How To Get Started With Spark and Scala

Introduction

I wanted to write a brief tutorial on how to get set up building Spark apps using Scala in my local environment because I have a couple of posts I want to write that depend on this set up.   My hope is that this will be straight forward, and that by the end anyone will be able to build and run all the Spark Examples @  https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples

Outline

The process that we are going to follow is below:

  1. Install Scala Build Tool (SBT)
  2. Install Scala IDE (IntelliJ)
    1. Install Scala Plugin for IntelliJ
  3. Build and Run the Pi Example
  4. Build and Run the Spark Examples

Warning:   This tutorial is originally writing in Nov 2016.   The rate of change of both Spark and IntelliJ may make these instructions obsolete sooner rather than later.  I provide links to sources wherever possible.   Post a comment if you do find something obsolete and I will attempt to fix it.

Install Scala Build Tool

The instructions for installing SBT 0.13.5 are found here.  Unfortunately the link uses the explicit version, and not ‘latest’, so you may want to click around until you find the latest instructions.

The Mac instructions and Ubuntu instructions are straight forward.

## NO BREW!?!?!?!?!?
## /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
#Mac Install
brew install sbt

#Ubuntu Install
echo "deb https://dl.bintray.com/sbt/debian /"
sudo tee -a /etc/apt/sources.list.d/sbt.list sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt

Install Scala IDE

Strictly speaking, you do not need an IDE for Scala, but I highly recommend one.  If you are new to Scala, and have build any Spark applications in python or R, you will likely get caught by Scala being strongly typed.   An IDE like IntelliJ will make it explicit for you, and catch mistakes as you make them.

Scala IDE Options

There are two options that are worth considering ala Quora and Reddit

  1. Eclipse with Scala Plugin: http://scala-ide.org/
  2. IntelliJ (Community Edition): https://www.jetbrains.com/idea/download/ 

I use IntelliJ, and if you choose this route you will need install the SBT Plugin.

  1. Open Prefrences
  2. Click on the ‘Plugins’ tab on the right hand side
  3. Search for ‘SBT’ or ‘Scala’ in the search panel on the right hand side
  4. Click ‘Install’

sbt-plugin.png

Full instructions for installing plugins are found here : https://www.jetbrains.com/help/idea/2016.2/installing-updating-and-uninstalling-repository-plugins.html

Build and Run Pi Example

Setup

The first thing we need to do is set up an project.  I have made a template and it is hosted on github.

git clone https://github.com/bryantravissmith/Spark-Scala-Template.git

Using IntelliJ, you can import the exisiting project after you have cloned it.  Open IntelliJ and create a project from an existing source.

existing-source-intellij

Select the cloned location and import it.

select-source-intellij.png

Choose import SBT for the existing project.

choose-sbt-intellij.png

Click Next, choose your preferences, and click finished.  The project will take a little time to import and download supporting files.   After all is done you should see an open project with the following structure.

import-complete-intellij.png

Review build.sbt Structure

If you are new to Scala than I think you might be curious about this build.sbt file.  It is used to sets up our project and describes the libraries we want to include in our project.   Here is the file with some additional comments


//Defines the version of our application and the scala version
//Change scala version for your project
// Spark 2+ Scala 2.11, Spark 1.6 Scala 2.10
val globalSettings = Seq(
  version := "1.0",
  scalaVersion := "2.11.8"
)

//Defines locations to look for dependencies
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
resolvers += "Repo at github.com/ankurdave/maven-repo" at "https://github.com/ankurdave/maven-repo/raw/master"

//Defines project structure.
lazy val sparkScalaTemplate = (project in file("."))
 .settings(name := "SparkAppName-ToBeReplaced")
 .settings(globalSettings:_*)
 .settings(libraryDependencies ++= dependencies)

//Defines the version of spark we are going to use
val sparkVersion = "2.0.1"

//Lists the dependencies
lazy val dependencies = Seq(
 "org.apache.spark" %% "spark-core" % sparkVersion,
  // %% means auto include the Scala verion
  // "org.apache.spark" % "spark-mllib_2.11" % sparkVersion,
 "org.apache.spark" %% "spark-mllib" % sparkVersion,
 "org.apache.spark" %% "spark-graphx" % sparkVersion,
 "org.apache.spark" %% "spark-sql" % sparkVersion
)

//Defines how to include jars in the UberJar
assemblyMergeStrategy in assembly := {
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}

The first thing you are likely to do is want add additional libraries to integrate with your project or to integrate with your production system.  Maven is going to be the best and first source for these libraries.   You can see the different spark projects here:  https://mvnrepository.com/artifact/org.apache.spark

screen-shot-2016-11-13-at-8-27-30-am

You can see the structure that is in the build.sbt file in this.  If we click on the first link, https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10, we see the following (Forgive the AMEX add).

Screen Shot 2016-11-13 at 8.31.18 AM.png

You can see the build.sbt format from this.

Group:  org.apache.spark
Artifact: spark-core_2.10
Version: 2.0.1

"org.apache.spark" %% "spark-core" % "2.0.1"
"org.apache.spark" % "spark-score_2.11" % "2.0.1"

Aside – Including a new dependency

Lets say I want my spark app to write to a PostgreSQL 9.4 Database.  I would start off by searching mvn for ‘postgresql;.  It quickly takes me here: https://mvnrepository.com/artifact/org.postgresql/postgresql

screen-shot-2016-11-13-at-8-42-50-am

I can now add a dependencies to the build.sbt

Group: org.postgresql
Artifact: postgresql //Notice no scala version
Version: 9.4.12011.jre7

lazy val dependencies = Seq(
 "org.apache.spark" %% "spark-core" % sparkVersion,
 "org.apache.spark" %% "spark-mllib" % sparkVersion,
 "org.apache.spark" %% "spark-graphx" % sparkVersion,
 "org.apache.spark" %% "spark-sql" % sparkVersion,
 "org.postgresql" % "postgresql" % "9.4.1211.jre7"
)

To get the documentation on how to use this library I would do a google search of ‘org.postgresql javadocs’.  The first result is https://jdbc.postgresql.org/documentation/publicapi/.  I will let you engage in the thrill of discovery in how to implement this.

Spark PI

Now that we have a project, lets quickly get to implementing and running the PI example.  I have included it in the github template with a minor addition.

package org.apache.spark.examples

import scala.math.random

import org.apache.spark.sql.SparkSession

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder
      .master("local[*]") //Added from originals for local run in IntelliJ
      .appName("Spark Pi")
      .getOrCreate()
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(1000000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y <= 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / (n - 1))
    spark.stop()
  }
}

Run

With SparkPi.scala open in the IntelliJ IDE you can ‘Run’ the app.

Screen Shot 2016-11-13 at 12.49.59 PM.png

The output is below:

Pi is roughly 3.1416575708287855

Build and Run any Spark Example

So now I’m going to work through pulling all of the Spark Examples so you can build and play with yourself.

First we are going to clone a portion of the repo, and restructure the files to fit into the IntelliJ project format.

mkdir SparkExamples
cd SparkExamples
git init
git remote add -f origin https://github.com/apache/spark.git
git config core.sparseCheckout true
echo "data/*" >> .git/info/sparse-checkout
echo "examples/src/main/scala/*" >> .git/info/sparse-checkout
git pull origin master
mv examples/* ./
rm -r examples
rm -r src/main/scala/org/apache/spark/examples/streaming/

Now we want to create a new Scala-SBT  project in this directory.

Screen Shot 2016-11-13 at 1.56.29 PM.png

screen-shot-2016-11-13-at-2-01-49-pm

Now you can replace the build.sbt with the below code.


val globalSettings = Seq(
  version := "1.0",
  scalaVersion := "2.11.8"
)

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
resolvers += "Repo at github.com/ankurdave/maven-repo" at "https://github.com/ankurdave/maven-repo/raw/master"

lazy val sparkScalaTemplate = (project in file("."))
  .settings(name := "SparkExamples")
  .settings(globalSettings:_*)
  .settings(libraryDependencies ++= dependencies)

val sparkVersion = "2.0.1"

lazy val dependencies = Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark"    %% "spark-sql"             % sparkVersion,
  "org.apache.spark"    %% "spark-mllib"           % sparkVersion,
  "org.apache.spark"    %% "spark-graphx"          % sparkVersion,
  "com.github.scopt" %% "scopt" % "3.5.0"
)

When you save this IntelliJ will ask if you want to auto import and/or refresh.  I set it to do this automatically.  There may be some red warnings.  Ignore them for a moment.

Go ahead and open SparkKMeans.scala and run it.

I get a few errors having to do with the examples using previous versions of the Spark API.  There are only three files.

LogisticRegressionWithElasticNetExample.scala

comment out .setFamily("multinomial") on line 54
change coefficientMatrix to coefficients on line 59
change interceptMatrix to intercept on line 60

MulticlassLogisticRegressionWithElasticNetExample.scala

change coefficientMatrix to coefficients on line 50
change interceptMatrix to intercept on line 51

SparkSQLExample.scala

change df.createGlobalTempView("people") to df.createTempView("people")

Now when you try to run SparkKMeans.scala you will get the following error because we have not supplied arguments.

Usage: SparkKMeans <file> <k> <convergeDist>

Process finished with exit code 1

In the top right of the IDE we can edit and change the configurations.

screen-shot-2016-11-13-at-2-36-15-pm

And add the following arguments

data/mllib/kmeans_data.txt 2 1.0

screen-shot-2016-11-13-at-2-38-31-pm

The last thing we need to do is add a master(“local[*]”) to the spark session builder.

val spark = SparkSession
.builder
.master(&quot;local[*]&quot;)
.appName(&quot;SparkKMeans&quot;)
.getOrCreate()

Now we can run the example and get the following output.

Final centers:
DenseVector(0.2, 0.2, 0.2)
DenseVector(0.1, 0.1, 0.1)

Conclusion

At this point you should be able to run all the Spark Examples with some minor edits.  All the examples will need to add master(“local[*]”) in the building the spark session.  Some of the examples will require the you provide arguments.

Generating Cash Flow Expectations For Lending Club in R

OC R User Group Meet-Up

This a a summary of the presentation I gave at the October OC R User Group Meet Up. The focus on the talk was to solve the following problem statement using R.

I am investing in Lending Club notes. Specifically Grade A, 36 Month loans. I want to generate an expected cash flow for my investment.

Note

Sometimes wordpress.com will format code like the pipes and R assignment incorrectly.  You can find the code for this post here:  https://github.com/bryantravissmith/OC-RUG

Setup

I want to generate my expectations for grade A, 36 months loan. The main goal is to generate a framework for the solution that it can also be applied to other loans. I am starting off creating a data frame from all the join Lending Club Data from Jun \ 30^{th}, 2016 when I selected the target loans from tranches of loan that are mature.

library(dplyr)
library(reshape2)
library(ggplot2)
library(zoo)
library(stringr)
library(knitr)
library(DT)

# Previously loaded all LC data into a single data frame and saved as RData
load(&quot;/opt/science/Datasets/LendingClub/2016-06-30/lcCombinded.RData&quot;)

lcA &lt;- lc %&gt;% mutate(
    #Crate a Data Object for the Issue Date
    issue_yearmon = as.Date(as.yearmon(issue_d,format=&quot;%b-%Y&quot;)),
    #Create a Numeric Issue Year For Group By Calculations
    issue_year = as.numeric(as.character(issue_yearmon,format=&quot;%Y&quot;)),
    #Convert to Date object to calculate the 'Age' of the loan
    last_pymnt_yearmon = as.Date(as.yearmon(last_pymnt_d,format=&quot;%b-%Y&quot;)),
    #Converte interest rate to ta numeric value
    interest = as.numeric(str_replace(int_rate,&quot;%&quot;,&quot;&quot;))
  ) %&gt;%
  mutate(
    #Make what are roughly 1 month wide in the time between the loan was
    #originated and the time it stopped generating cash flows
    AgeBucket = cut(as.numeric((last_pymnt_yearmon-issue_yearmon)/365),
                         breaks=seq(0,6,1/12),
                         include.lowest=T,
                         right=F)
    ) %&gt;%
  mutate(
    #Converte the Age Bucket to a Numeric Month or Statement value
    Age = match(AgeBucket,levels(AgeBucket))
    ) %&gt;%
  mutate(
    #Get the total recieved payments for each loan
    total_rec = total_rec_prncp+total_rec_int
    ) %&gt;%
  #Limit to the data we are interested in for the problem
  #Remove 2007 - stands out in graphs, actually doesn't change the results too much.
  filter(term == &quot; 36 months&quot;,grade == &quot;A&quot;, issue_year  &lt; 2013, issue_year &gt; 2007)

Cash Flow Expectation : Paid As Agreed

The simplest expectation is to just assume that everyone will make every schedule payment. The equation for the payment of a fix term loan is:

Payment = \frac{LoanAmount * MonthlyInterestRate}{1 - \frac{1}{(1+MonthlyInterestRate)^{Term}}}

You can then use this to get the amortization schedule for the loan.

# Generates the expected payment for a fixed term loan.
payment_value &lt;- function(loan_amount,interest_rate,term_number) {
  loan_amount*interest_rate/12/(1 - (1+interest_rate/12)^(-1*term_number))
}

#Gereates the scheduled payment behavior for a loan that is paid as agreed
get_amortization &lt;- function(term_number,interest_rate){
  amortization &lt;- data.frame(statement = seq(0,term_number,1),
                             payment = c(0,rep(payment_value(1,interest_rate,term_number),term_number)),
                             interest_payment = 0,
                             principal_payment = 0,
                             start_balance = 0,
                             end_balance = 1)
  for(i in (1:term_number+1)){
    amortization[i,'start_balance'] = amortization[i-1,'end_balance']*(1+interest_rate/12)
    amortization[i,'interest_payment'] = interest_rate*amortization[i-1,'end_balance']/12
    amortization[i,'principal_payment'] = amortization[i,'payment'] - amortization[i,'interest_payment']
    amortization[i,'end_balance'] = amortization[i,'start_balance']-amortization[i,'payment']
  }
  amortization[term_number+1,'end_balance']=0
  return(amortization)
}

#Helper function to round numeric values when writing to data table
round_df &lt;- function(df, digits) {
  nums &lt;- vapply(df, is.numeric, FUN.VALUE = logical(1))
  df[,nums] &lt;- round(df[,nums], digits = digits)   (df) } #Printing values get_amortization(36,0.078) %&gt;%
  round_df(4) %&gt;%
  kable(format = &quot;markdown&quot;)
statement payment interest_payment principal_payment start_balance end_balance
0 0.0000 0.0000 0.0000 0.0000 1.0000
1 0.0312 0.0065 0.0247 1.0065 0.9753
2 0.0312 0.0063 0.0249 0.9816 0.9504
3 0.0312 0.0062 0.0251 0.9565 0.9253
4 0.0312 0.0060 0.0252 0.9313 0.9001
5 0.0312 0.0059 0.0254 0.9059 0.8747
6 0.0312 0.0057 0.0256 0.8803 0.8491
7 0.0312 0.0055 0.0257 0.8546 0.8234
8 0.0312 0.0054 0.0259 0.8287 0.7975
9 0.0312 0.0052 0.0261 0.8027 0.7714
10 0.0312 0.0050 0.0262 0.7764 0.7452
11 0.0312 0.0048 0.0264 0.7500 0.7188
12 0.0312 0.0047 0.0266 0.7235 0.6922
13 0.0312 0.0045 0.0267 0.6967 0.6655
14 0.0312 0.0043 0.0269 0.6698 0.6386
15 0.0312 0.0042 0.0271 0.6427 0.6115
16 0.0312 0.0040 0.0273 0.6154 0.5842
17 0.0312 0.0038 0.0274 0.5880 0.5567
18 0.0312 0.0036 0.0276 0.5604 0.5291
19 0.0312 0.0034 0.0278 0.5326 0.5013
20 0.0312 0.0033 0.0280 0.5046 0.4733
21 0.0312 0.0031 0.0282 0.4764 0.4452
22 0.0312 0.0029 0.0284 0.4481 0.4168
23 0.0312 0.0027 0.0285 0.4195 0.3883
24 0.0312 0.0025 0.0287 0.3908 0.3596
25 0.0312 0.0023 0.0289 0.3619 0.3307
26 0.0312 0.0021 0.0291 0.3328 0.3016
27 0.0312 0.0020 0.0293 0.3035 0.2723
28 0.0312 0.0018 0.0295 0.2740 0.2428
29 0.0312 0.0016 0.0297 0.2444 0.2131
30 0.0312 0.0014 0.0299 0.2145 0.1833
31 0.0312 0.0012 0.0301 0.1845 0.1532
32 0.0312 0.0010 0.0302 0.1542 0.1230
33 0.0312 0.0008 0.0304 0.1238 0.0925
34 0.0312 0.0006 0.0306 0.0931 0.0619
35 0.0312 0.0004 0.0308 0.0623 0.0310
36 0.0312 0.0002 0.0310 0.0312 0.0000

Results: Paid As Agreed

I am now going to use the Lending Club data to calculate actual yield of grade A, 36 month loans and compare it to the expectations generated by the above payment schedule. I will do this calculation by year for each year of mature loans in Lending Club. We are assuming there is no prepayment and charge off behavior, so I will also calculate these to see scale of these behaviors.

lcA %&gt;%
  # Perform the following calculation for each year
  group_by(issue_year) %&gt;%
  # Aggregrate for each year
  summarise(
    #Number of Loans
    Count = n(),
    #Dollars Loaned Out
    DollarsFunded = sum(funded_amnt),
    #Dollars collected from payments
    DollarsRecieved = sum(total_rec),
    #Percent Charge Offs
    Chargeoffs = round(sum(ifelse(grepl(&quot;Charged Off&quot;,loan_status),1,0))/Count*100,1),
    #Percent Prepayments
    Prepays = round(sum(ifelse(grepl(&quot;Fully Paid&quot;,loan_status)&amp;(Age &lt; 36),1,0))/Count*100,1),     #The Percent Increase on the Dollars Loans     Yield = round(100*(DollarsRecieved/DollarsFunded-1),1),     #Dollar weighted interest rate for each year             Interest = round(weighted.mean(interest,funded_amnt),1)) %&gt;%
  mutate(
    #Applys the amortization schedule and sumes payments recieved for each loan
    ExpectedYield = apply( (.),
                             1,
                             function(x) sum(get_amortization(36,
                                                              as.numeric(x['Interest'])/100)$payment))) %&gt;%
  mutate(
    #Rounds Expected Yield to a percent
    ExpectedYield = round(100*(ExpectedYield-1),1)) %&gt;%
  #Filter to relevant variables
  select(issue_year,DollarsFunded,Prepays,Chargeoffs,Interest,Yield,ExpectedYield) %&gt;%
  #Convert to % for display
  mutate(Prepays=paste0(Prepays,&quot;%&quot;),
         Chargeoffs=paste0(Chargeoffs,&quot;%&quot;),
         Interest=paste0(Interest,&quot;%&quot;),
         Yield=paste0(Yield,&quot;%&quot;),
         ExpectedYield = paste0(ExpectedYield,&quot;%&quot;)) %&gt;%
  #Create a data table
  kable( format = &quot;markdown&quot;)
issue_year DollarsFunded Prepays Chargeoffs Interest Yield ExpectedYield
2008 1899800 49.4% 6% 8.5% 7.6% 13.6%
2009 8700550 49.7% 6.7% 8.8% 7.6% 14.1%
2010 20650050 53.1% 4.4% 7.2% 7.4% 11.5%
2011 49784200 50.9% 6.4% 7.2% 6.4% 11.5%
2012 119545400 52.4% 7.2% 7.7% 6.4% 12.3%

Review: Paid As Agreed

I can see that there are signficant Prepayments and Charge Offs for each year of Lending Club, so I’ll try to include these in the expectations.

Cash Flow Expectation : Account For Charge Offs

I am going to include charge offs in the payment expectations. Once a loan charges off, it no longer makes payments. This makes it straight forward to include in the amortatization schedule because we only need to keep track of how many loans are left, and add the payments of the left over loans. In reality there is also deliquency behavior that should be accounted for, but we will have to do that another time.

To execute this, I need to get get a feel for when the loans charge off in a given traunch of loans.

# Charege Offs
lcA %&gt;%
  #Perform the following calculations for each year
  group_by(issue_year) %&gt;%
  ## Count the number of leans each year
  mutate(count=n()) %&gt;%
  ## Perform a summary for each year and age of loan
  group_by(issue_year,Age) %&gt;%
  ## Find the number of loans that charge off each month
  summarise(bad = sum(ifelse(grepl(&quot;Charged Off&quot;,loan_status),1,0)),
            #Get the count value for the year
            #count is a vector of the save value for each group
            count = max(count)) %&gt;%
  ## ORder by the age for the cumsum
  arrange(Age) %&gt;%
  ## Get the cumlative total of bad loans over the course of the term
  mutate(total_bad = cumsum(bad)) %&gt;%
  ## Group by the traunch of loans
  group_by(issue_year) %&gt;%
  ##Get the total number of bad loans for the traunch
  mutate(max_total_bad = max(total_bad)) %&gt;%
  ##Plot the cumlative percentage of chargeoffs for each traunch
  ggplot(aes(Age,total_bad/max_total_bad,color=factor(issue_year)))+
  geom_line()+
  theme_bw()+
  xlim(0,36)+
  ylab('Percent Total Chargeoffs')

chargeoffs

The cumulative charge off behavior has a characteristic shape, so we can use this shape to include in the expected number of units making payments in a given month. We can generate this expectations by just doing a weighted average on these curves (thought there are more thoughtful ways to do this as well). I’m going to take this average, and smoothing fit, and make a function of out it to include in the amortization of the loan.

percentChargeoffFunc &lt;- lcA %&gt;%
  #Perform the following calculations for each year
  group_by(issue_year) %&gt;%
  ## Count the number of leans each year
  mutate(count=n()) %&gt;%
  ## Perform a summary for each year and age of loan
  group_by(issue_year,Age) %&gt;%
  ## Find the number of loans that charge off each month
  summarise(bad = sum(ifelse(grepl(&quot;Charged Off&quot;,loan_status),1,0)),
            #Get the count value for the year
            #count is a vector of the save value for each group
            count = max(count)) %&gt;%
  ## ORder by the age for the cumsum
  arrange(Age) %&gt;%
  ## Limit to 36 months
  filter(Age &lt;= 36) %&gt;%
  ## Get the cumlative total of bad loans over the course of the term
  mutate(total_bad = cumsum(bad)) %&gt;%
  ## Group by the traunch of loans
  group_by(issue_year) %&gt;%
  ##Get the total number of bad loans for the traunch
  mutate(max_total_bad = max(total_bad)) %&gt;%
  ##Calculate percent bad
  mutate(percent_bad = total_bad/max_total_bad)  %&gt;%
  ## Group by Age to do the weighted average of all the curves
  group_by(Age) %&gt;%
  ## Calculate the weighted average by the traunch size
  summarise(avg_percent_total_bad = weighted.mean(percent_bad,count)) %&gt;%
  ## Return Approx Function to describe Chargeoff Behavior
  (function(df) {
    ## Make a loess fit of the average to smooth out
    mod &lt;- loess(avg_percent_total_bad ~ Age,df,span=1/4)
    ## make the predictions
    pred &lt;- predict(mod,df)
    ## create a data frame of the age and smoothed predictions
    tmp &lt;- data.frame(Age = df$Age,pred_percent_total_bad=pred)     ## make sure the new smooth values are between 0 and 1     tmp$pred_percent_total_bad = tmp$pred_percent_total_bad-min(tmp$pred_percent_total_bad)     tmp$pred_percent_total_bad = tmp$pred_percent_total_bad/max(tmp$pred_percent_total_bad)     # return approx function of the smooth function     approxfun(tmp$Age,tmp$pred_percent_total_bad,method='linear',yleft = 0,yright=1)   })  ## Plot Function Output data.frame(Age=seq(0,36),            pco = percentChargeoffFunc(seq(0,36))) %&gt;%
  ggplot(aes(Age,pco))+
  geom_line()+
  theme_bw()+
  xlim(0,36)+
  ylab('Percent Total Chargeoffs')

chargeoffs_smooth

Now that I have the function, I can include the expected charge off function times the actual scale to update our cash flow expectations.

#Ammortizaiton Schedule for loans with Charge Off Expectations
get_amortization_chargeoff &lt;- function(term_number,interest_rate,percent_chargeoff){
  amortization &lt;- data.frame(statement=seq(0,term_number,1),
                             payment=c(0,rep(payment_value(1,interest_rate,term_number),term_number)),
                             interest_payment = 0,
                             principal_payment=0,
                             start_balance=0,
                             end_balance=1,
                             #Added a units chargeoff which is the new curve times the measured scale
                             unit_chargeoff = percent_chargeoff*percentChargeoffFunc(seq(0,36)))
  for(i in (1:term_number+1)){
    amortization[i,'start_balance'] = amortization[i-1,'end_balance']*(1+interest_rate/12)
    amortization[i,'interest_payment'] = interest_rate*amortization[i-1,'end_balance']/12
    amortization[i,'principal_payment'] = amortization[i,'payment'] - amortization[i,'interest_payment']
    amortization[i,'end_balance'] = amortization[i,'start_balance']-amortization[i,'payment']
  }
  amortization[term_number+1,'end_balance']=0
  #Adds a payment recieved which is the schedule payments times the number of units left on a given statement
  amortization &lt;- amortization %&gt;% mutate(payment_recieved = (1-unit_chargeoff)*payment)
  return(amortization)
}

#Print output
get_amortization_chargeoff(36,0.07,0.06) %&gt;%
  round_df(4) %&gt;%
  kable( format = &quot;markdown&quot;)
statement payment interest_payment principal_payment start_balance end_balance unit_chargeoff payment_recieved
0 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000
1 0.0309 0.0058 0.0250 1.0058 0.9750 0.0000 0.0309
2 0.0309 0.0057 0.0252 0.9806 0.9498 0.0008 0.0309
3 0.0309 0.0055 0.0253 0.9553 0.9244 0.0019 0.0308
4 0.0309 0.0054 0.0255 0.9298 0.8989 0.0033 0.0308
5 0.0309 0.0052 0.0256 0.9042 0.8733 0.0051 0.0307
6 0.0309 0.0051 0.0258 0.8784 0.8475 0.0070 0.0307
7 0.0309 0.0049 0.0259 0.8525 0.8216 0.0088 0.0306
8 0.0309 0.0048 0.0261 0.8264 0.7955 0.0107 0.0305
9 0.0309 0.0046 0.0262 0.8002 0.7693 0.0130 0.0305
10 0.0309 0.0045 0.0264 0.7738 0.7429 0.0152 0.0304
11 0.0309 0.0043 0.0265 0.7472 0.7163 0.0172 0.0303
12 0.0309 0.0042 0.0267 0.7205 0.6896 0.0191 0.0303
13 0.0309 0.0040 0.0269 0.6937 0.6628 0.0212 0.0302
14 0.0309 0.0039 0.0270 0.6667 0.6358 0.0235 0.0302
15 0.0309 0.0037 0.0272 0.6395 0.6086 0.0258 0.0301
16 0.0309 0.0036 0.0273 0.6122 0.5813 0.0282 0.0300
17 0.0309 0.0034 0.0275 0.5847 0.5538 0.0305 0.0299
18 0.0309 0.0032 0.0276 0.5570 0.5261 0.0327 0.0299
19 0.0309 0.0031 0.0278 0.5292 0.4983 0.0345 0.0298
20 0.0309 0.0029 0.0280 0.5012 0.4704 0.0363 0.0298
21 0.0309 0.0027 0.0281 0.4731 0.4422 0.0383 0.0297
22 0.0309 0.0026 0.0283 0.4448 0.4139 0.0399 0.0296
23 0.0309 0.0024 0.0285 0.4164 0.3855 0.0414 0.0296
24 0.0309 0.0022 0.0286 0.3877 0.3569 0.0429 0.0296
25 0.0309 0.0021 0.0288 0.3589 0.3281 0.0447 0.0295
26 0.0309 0.0019 0.0290 0.3300 0.2991 0.0469 0.0294
27 0.0309 0.0017 0.0291 0.3008 0.2700 0.0490 0.0294
28 0.0309 0.0016 0.0293 0.2715 0.2407 0.0510 0.0293
29 0.0309 0.0014 0.0295 0.2421 0.2112 0.0528 0.0292
30 0.0309 0.0012 0.0296 0.2124 0.1815 0.0543 0.0292
31 0.0309 0.0011 0.0298 0.1826 0.1517 0.0556 0.0292
32 0.0309 0.0009 0.0300 0.1526 0.1217 0.0567 0.0291
33 0.0309 0.0007 0.0302 0.1224 0.0916 0.0577 0.0291
34 0.0309 0.0005 0.0303 0.0921 0.0612 0.0586 0.0291
35 0.0309 0.0004 0.0305 0.0616 0.0307 0.0594 0.0290
36 0.0309 0.0002 0.0307 0.0309 0.0000 0.0600 0.0290

Results: Account For Charge Offs

lcA %&gt;%
  # Perform the following calculation for each year
  group_by(issue_year) %&gt;%
  # Aggregrate for each year
  summarise(
    #Number of Loans
    Count = n(),
    #Dollars Loaned Out
    DollarsFunded = sum(funded_amnt),
    #Dollars collected from payments
    DollarsRecieved = sum(total_rec),
    #Percent Charge Offs
    Chargeoffs = round(sum(ifelse(grepl(&quot;Charged Off&quot;,loan_status),1,0))/Count*100,1),
    #Percent Prepayments
    Prepays = round(sum(ifelse(grepl(&quot;Fully Paid&quot;,loan_status)&amp;(Age &lt; 36),1,0))/Count*100,1),     #The Percent Increase on the Dollars Loans     Yield = round(100*(DollarsRecieved/DollarsFunded-1),1),     #Dollar weighted interest rate for each year             Interest = round(weighted.mean(interest,funded_amnt),1)) %&gt;%
  mutate(
    #Applys the amortization schedule and sumes payments recieved for each loan
    ExpectedYield = apply( (.),
                              1,
                              function(x) sum(get_amortization_chargeoff(36,
                                                                         as.numeric(x['Interest'])/100,
                                                                         as.numeric(x['Chargeoffs'])/100
                              )$payment_recieved))) %&gt;%
  mutate(
    #Rounds Expected Yield to a percent
    ExpectedYield = round(100*(ExpectedYield-1),1)
    ) %&gt;%
  #Filter to relevant variables
  select(issue_year,DollarsFunded,Prepays,Chargeoffs,Interest,Yield,ExpectedYield) %&gt;%
  #Convert to % for display
  mutate(Prepays=paste0(Prepays,&quot;%&quot;),
         Chargeoffs=paste0(Chargeoffs,&quot;%&quot;),
         Interest=paste0(Interest,&quot;%&quot;),
         Yield=paste0(Yield,&quot;%&quot;),
         ExpectedYield = paste0(ExpectedYield,&quot;%&quot;)) %&gt;%
  #Create a data table
  kable( format = &quot;markdown&quot;)
issue_year DollarsFunded Prepays Chargeoffs Interest Yield ExpectedYield
2008 1899800 49.4% 6% 8.5% 7.6% 10%
2009 8700550 49.7% 6.7% 8.8% 7.6% 10.1%
2010 20650050 53.1% 4.4% 7.2% 7.4% 8.9%
2011 49784200 50.9% 6.4% 7.2% 6.4% 7.7%
2012 119545400 52.4% 7.2% 7.7% 6.4% 8%

Review: Account For Charge Offs

I see that the expected values have decreased from the previous expectations. The expected and actuals are closer in line, but there is room for improvement. I have accounted for the charge offs, but not the prepayment behavior.

Cash Flow Expectation : Account For Prepayments & Charge Offs

I can now repeat the process I did for including the effect of charge offs in our cash flow expectations. I can see if there is a characteristic cumulative behavior, generate a function that approximates the behavior, and incorporate it into the amortization.

# Prepays
lcA %&gt;%
  ## Perform the following calculations for each year
  group_by(issue_year) %&gt;%
  ## Count the number of leans each year
  mutate(count=n()) %&gt;%
  ## Perform a summary for each year and age of loan
  group_by(issue_year,Age) %&gt;%
  ## Find the number of loans that prepayments each month
  summarise(prepay = sum(ifelse(grepl(&quot;Fully Paid&quot;,loan_status),1,0)),
            ## Get the count value for the year
            ## count is a vector of the save value for each group
            count = max(count)) %&gt;%
  ## ORder by the age for the cumsum
  arrange(Age) %&gt;%
  ## Can only prepay before the loan is mature
  filter(Age &lt; 36) %&gt;%
  ## Get the cumlative total of prepaid loans over the course of the term
  mutate(total_prepay = cumsum(prepay)) %&gt;%
  ## Group by the traunch of loans
  group_by(issue_year) %&gt;%
  ## Get the total number of prepayments for the traunch
  mutate(max_total_prepay = max(total_prepay)) %&gt;%
  ## Plot the cumlative percentage of prepaid loans for each traunch
  ggplot(aes(Age,total_prepay/max_total_prepay,color=factor(issue_year)))+
  geom_line()+
  theme_bw()+
  xlim(0,36)+
  ylab('Percent Total Prepayment')

prepays

The cumulative prepayment behavior over the life does have a characteristic shape, so I will generate an approximation function similar to charge offs.

percentPrepaymentFunc &lt;- lcA %&gt;%
  ## Perform the following calculations for each year
  group_by(issue_year) %&gt;%
  ## Count the number of leans each year
  mutate(count=n()) %&gt;%
  ## Perform a summary for each year and age of loan
  group_by(issue_year,Age) %&gt;%
  ## Find the number of loans that prepayments each month
  summarise(prepay = sum(ifelse(grepl(&quot;Fully Paid&quot;,loan_status),1,0)),
            ## Get the count value for the year
            ## count is a vector of the save value for each group
            count = max(count)) %&gt;%
  ## ORder by the age for the cumsum
  arrange(Age) %&gt;%
  ## Can only prepay before the loan is mature
  filter(Age &lt; 36) %&gt;%
  ## Get the cumlative total of prepaid loans over the course of the term
  mutate(total_prepay = cumsum(prepay)) %&gt;%
  ## Group by the traunch of loans
  group_by(issue_year) %&gt;%
  ## Get the total number of prepayments for the traunch
  mutate(max_total_prepay = max(total_prepay))  %&gt;%
  ## Calculate the percentage of prepayments that happen in a given
  ## month out of all of the prepayments that happen in the traunch
  mutate(percent_prepay = total_prepay/max_total_prepay) %&gt;%
  ## Calculate for eeach age
  group_by(Age) %&gt;%
  ##  The weighted mean of the percentage by the size of the traunch
  summarise(avg_percent_total_prepay = weighted.mean(percent_prepay,count)) %&gt;%
  ## Return an approx function
  (function(df) {
    ## perform a loess fit to smoooth the function
    mod &lt;- loess(avg_percent_total_prepay ~ Age,df,span=1/4)
    ## predict the results
    pred &lt;- predict(mod,df)
    ## create a data frame to use to make the approx function
    tmp &lt;- data.frame(Age = df$Age,pred_percent_total_prepay=pred)     ## Make sure the fit produces values between 0 and 1     tmp$pred_percent_total_prepay = tmp$pred_percent_total_prepay-min(tmp$pred_percent_total_prepay)     tmp$pred_percent_total_prepay = tmp$pred_percent_total_prepay/max(tmp$pred_percent_total_prepay)     ## Return the approx function     approxfun(tmp$Age,tmp$pred_percent_total_prepay,method='linear',yleft = 0,yright=1)   })  ## Plot the approx functions data.frame(Age=seq(0,36),            ppp = percentPrepaymentFunc(seq(0,36))) %&gt;%
  ggplot(aes(Age,ppp))+
  geom_line()+
  theme_bw()+
  xlim(0,36)+
  ylab('Percent Total Chargeoffs')

prepays_smooth

Results: Account For Prepayments & Charge Offs

I have the characteristic prepayment function and charge off function, so I can append it to the amortization schedule. Like loans that charge off, a prepaid loan does not generate payments after it is prepaid. Unlike charged off loans, it does return the principal. Once this is included, I suspect I will see higher than previously expected cashflows earlier due to returned principal.

#Ammortizaiton Schedule for loans with Charge Off and Prepayment Expectations
get_amortization_chargeoff_prepay &lt;- function(term_number,interest_rate,percent_chargeoff,percent_prepay){
  amortization &lt;- data.frame(statement=seq(0,term_number,1),
                             payment=c(0,rep(payment_value(1,interest_rate,term_number),term_number)),
                             interest_payment = 0,
                             principal_payment=0,
                             start_balance=0,
                             end_balance=1,
                             #Units chargeoffed is measures scale * expected percent of total
                             unit_chargeoff = percent_chargeoff*percentChargeoffFunc(seq(0,36)),
                             #Units prepyament is measured scale * expected percent of total
                             unit_prepay = percent_prepay*percentPrepaymentFunc(seq(0,36)))
  for(i in (1:term_number+1)){
    amortization[i,'start_balance'] = amortization[i-1,'end_balance']*(1+interest_rate/12)
    amortization[i,'interest_payment'] = interest_rate*amortization[i-1,'end_balance']/12
    amortization[i,'principal_payment'] = amortization[i,'payment'] - amortization[i,'interest_payment']
    amortization[i,'end_balance'] = amortization[i,'start_balance']-amortization[i,'payment']
  }
  amortization[term_number+1,'end_balance']=0
  amortization &lt;- amortization %&gt;%
    mutate(monthly_prepay = ifelse(is.na(lag(unit_prepay)),0,unit_prepay-lag(unit_prepay))) %&gt;%
    mutate(payment_recieved = (1-unit_chargeoff-unit_prepay)*payment+monthly_prepay*(end_balance+payment))
  return(amortization)
}

get_amortization_chargeoff_prepay(36,0.07,0.06,0.50) %&gt;%
  round_df(4) %&gt;%
  kable( format = &quot;markdown&quot;)
statement payment interest_payment principal_payment start_balance end_balance unit_chargeoff unit_prepay monthly_prepay payment_recieved
0 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000
1 0.0309 0.0058 0.0250 1.0058 0.9750 0.0000 0.0000 0.0000 0.0309
2 0.0309 0.0057 0.0252 0.9806 0.9498 0.0008 0.0064 0.0064 0.0369
3 0.0309 0.0055 0.0253 0.9553 0.9244 0.0019 0.0135 0.0071 0.0372
4 0.0309 0.0054 0.0255 0.9298 0.8989 0.0033 0.0214 0.0079 0.0375
5 0.0309 0.0052 0.0256 0.9042 0.8733 0.0051 0.0298 0.0084 0.0374
6 0.0309 0.0051 0.0258 0.8784 0.8475 0.0070 0.0390 0.0093 0.0376
7 0.0309 0.0049 0.0259 0.8525 0.8216 0.0088 0.0492 0.0101 0.0377
8 0.0309 0.0048 0.0261 0.8264 0.7955 0.0107 0.0600 0.0108 0.0376
9 0.0309 0.0046 0.0262 0.8002 0.7693 0.0130 0.0718 0.0118 0.0377
10 0.0309 0.0045 0.0264 0.7738 0.7429 0.0152 0.0825 0.0107 0.0361
11 0.0309 0.0043 0.0265 0.7472 0.7163 0.0172 0.0928 0.0103 0.0352
12 0.0309 0.0042 0.0267 0.7205 0.6896 0.0191 0.1039 0.0111 0.0351
13 0.0309 0.0040 0.0269 0.6937 0.6628 0.0212 0.1168 0.0129 0.0356
14 0.0309 0.0039 0.0270 0.6667 0.6358 0.0235 0.1321 0.0153 0.0363
15 0.0309 0.0037 0.0272 0.6395 0.6086 0.0258 0.1474 0.0153 0.0353
16 0.0309 0.0036 0.0273 0.6122 0.5813 0.0282 0.1631 0.0157 0.0346
17 0.0309 0.0034 0.0275 0.5847 0.5538 0.0305 0.1788 0.0157 0.0336
18 0.0309 0.0032 0.0276 0.5570 0.5261 0.0327 0.1942 0.0154 0.0324
19 0.0309 0.0031 0.0278 0.5292 0.4983 0.0345 0.2096 0.0154 0.0315
20 0.0309 0.0029 0.0280 0.5012 0.4704 0.0363 0.2256 0.0160 0.0308
21 0.0309 0.0027 0.0281 0.4731 0.4422 0.0383 0.2432 0.0176 0.0305
22 0.0309 0.0026 0.0283 0.4448 0.4139 0.0399 0.2599 0.0167 0.0290
23 0.0309 0.0024 0.0285 0.4164 0.3855 0.0414 0.2759 0.0160 0.0278
24 0.0309 0.0022 0.0286 0.3877 0.3569 0.0429 0.2923 0.0163 0.0269
25 0.0309 0.0021 0.0288 0.3589 0.3281 0.0447 0.3098 0.0176 0.0262
26 0.0309 0.0019 0.0290 0.3300 0.2991 0.0469 0.3292 0.0194 0.0257
27 0.0309 0.0017 0.0291 0.3008 0.2700 0.0490 0.3486 0.0193 0.0244
28 0.0309 0.0016 0.0293 0.2715 0.2407 0.0510 0.3681 0.0196 0.0232
29 0.0309 0.0014 0.0295 0.2421 0.2112 0.0528 0.3872 0.0191 0.0219
30 0.0309 0.0012 0.0296 0.2124 0.1815 0.0543 0.4062 0.0189 0.0207
31 0.0309 0.0011 0.0298 0.1826 0.1517 0.0556 0.4249 0.0187 0.0195
32 0.0309 0.0009 0.0300 0.1526 0.1217 0.0567 0.4438 0.0190 0.0183
33 0.0309 0.0007 0.0302 0.1224 0.0916 0.0577 0.4632 0.0194 0.0172
34 0.0309 0.0005 0.0303 0.0921 0.0612 0.0586 0.4819 0.0187 0.0159
35 0.0309 0.0004 0.0305 0.0616 0.0307 0.0594 0.5000 0.0181 0.0147
36 0.0309 0.0002 0.0307 0.0309 0.0000 0.0600 0.5000 0.0000 0.0136

Now I can add the prepayment behavior into our cashflows and compare the expected to the actuals.

lcA %&gt;%
  # Perform the following calculation for each year
  group_by(issue_year) %&gt;%
  # Aggregrate for each year
  summarise(
    #Number of Loans
    Count = n(),
    #Dollars Loaned Out
    DollarsFunded = sum(funded_amnt),
    #Dollars collected from payments
    DollarsRecieved = sum(total_rec),
    #Percent Charge Offs
    Chargeoffs = round(sum(ifelse(grepl(&quot;Charged Off&quot;,loan_status),1,0))/Count*100,1),
    #Percent Prepayments
    Prepays = round(sum(ifelse(grepl(&quot;Fully Paid&quot;,loan_status)&amp;(Age &lt; 36),1,0))/Count*100,1),     #The Percent Increase on the Dollars Loans     Yield = round(100*(DollarsRecieved/DollarsFunded-1),1),     #Dollar weighted interest rate for each year             Interest = round(weighted.mean(interest,funded_amnt),1)) %&gt;%
  mutate(
     #Applys the amortization schedule and sumes payments recieved for each loan
    ExpectedYield = apply( (.),
                              1,
                              function(x) sum(get_amortization_chargeoff_prepay(36,
                                                                         as.numeric(x['Interest'])/100,
                                                                         as.numeric(x['Chargeoffs'])/100,
                                                                         as.numeric(x['Prepays'])/100
                              )$payment_recieved))) %&gt;%
  mutate(
    #Rounds Expected Yield to a percent
    ExpectedYield = round(100*(ExpectedYield-1),1)
    ) %&gt;%
  #Filter to relevant variables
  select(issue_year,DollarsFunded,Prepays,Chargeoffs,Interest,Yield,ExpectedYield) %&gt;%
  #Convert to % for display
  mutate(Prepays=paste0(Prepays,&quot;%&quot;),
         Chargeoffs=paste0(Chargeoffs,&quot;%&quot;),
         Interest=paste0(Interest,&quot;%&quot;),
         Yield=paste0(Yield,&quot;%&quot;),
         ExpectedYield = paste0(ExpectedYield,&quot;%&quot;)) %&gt;%
  #Create a data table
  kable(format = &quot;markdown&quot;)
issue_year DollarsFunded Prepays Chargeoffs Interest Yield ExpectedYield
2008 1899800 49.4% 6% 8.5% 7.6% 8.4%
2009 8700550 49.7% 6.7% 8.8% 7.6% 8.3%
2010 20650050 53.1% 4.4% 7.2% 7.4% 7.4%
2011 49784200 50.9% 6.4% 7.2% 6.4% 6.3%
2012 119545400 52.4% 7.2% 7.7% 6.4% 6.4%

Review: Account For Prepayments & Charge Offs

Now I have a framework for cashflow expectations that comes within error of the actuals. This method requires that I generate exceptions for the expected charge off and prepayment scales to forecast cashflows, but once I have that, the rest is just straight forward math. The cashflow model that comes describes lending club actuals is using the paid as agreed model, then account for charge off and prepayment behavior. Even though there are a number of simplification and assumptions in this framework, it is an improved understanding of the expected cashflows for a grade A, 36 month term loan in Lending Club.

‘Real-Time’ Online Machine Learning with Apache Kafka and Sklearn [plus other buzz words]

Screen Shot 2016-06-12 at 2.31.17 PM.png

I have been obsessed with online machine learning since I first read Richard Sutton’s  Learning to Predict by the Methods of Temporal Difference.  The idea of being able to update a model with each new bit of information felt closer to real cognition than traditional supervised methods.

I have also been reading more and more about real-time analytics, and their facilitation by distributed messaging systems like Apache Kafka.   Here is a list of companies that using Kafka.

I wanted to spend today getting Kafka running locally,  publishing and consuming using python, and run an online model on consumed data from Kafka that was published from a python simulation.

My work flow for today is:

  1. Download, Configure, and Run Kafka (Mac/Linux)
    1. Start Zookeeper
    2. Start two servers
    3. Create Kafka Topics
  2. Write a Simple Python Producer/Consumer
  3. Produce Some Simulation Data
  4. Fit a mod

Setting up Kafka

I downloaded the binary file from http://kafka.apache.org/downloads.html.  I used scala 2.11, but I do not believe it matters unless you plan on using scala to interface with Kafka.   If and when I do this with Spark, I will given more detailed instructions.  Since we’re doing this with python we can use either.

After I down load it I decompress the folder in my home directory and move into the directory.

Screen Shot 2016-06-12 at 1.21.25 PM.png

We are going to be working with the bin and config files.   First we need to set up config for our servers so I can have two servers.  Because, why not.  First we want to make edit the first file and create a log directory.

Screen Shot 2016-06-12 at 1.38.49 PM.png

I have ‘mkdir kafka-log-1’ in my home directory

Now we want to make a copy of the server.properties file than we want to edit the copy.

Screen Shot 2016-06-12 at 1.26.48 PM.png

We are changing the broker id, and the port number

Screen Shot 2016-06-12 at 1.25.16 PM.png

Screen Shot 2016-06-12 at 1.41.50 PM.png

I have also ran ‘mkdir kafka-log-2’ in my home directory.

Lets start Zookeeper and Kafka.

./bin/zookeeper-server-start.sh config/zookeeper.properties &amp;amp;amp;amp;
./bin/kafka-server-start.sh config/server.properties &amp;amp;amp;amp;
./bin/kafka-server-start.sh config/server2.properties &amp;amp;amp;amp;

These three processes are started in the background, but you will see a lot of log messages and warnings display.  I’m just going to ignore and move one since they are not blockers..

Creating Topics

Now that Kafka is running locally, we are going to create some topics.  One will be for our simple-test.  The other for our online-learning.

&amp;amp;amp;nbsp;./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic simple-test --partitions 2 --replication-factor 2
&amp;amp;amp;nbsp;./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic online --partitions 2 --replication-factor 2

 

After you have created your topics, you can make sure you can describe them

Screen Shot 2016-06-12 at 1.48.07 PM.png

Basic Python Producer and Consumer

There are two python Kafka libraries you can choose from:

  1. kafka-python
  2. pykafka

I am not an expert, and am using kafka-python.  It is easily installed using pip.

pip install kafka-python

I want to start with a simple consumer.  It is going to iterate through all messages in the topic ‘simple-test’ and print them out.  It will wait until I kill the process.  FYI: Kafka-python requires that I point them to the Kafka serve addresses:

#simple_consumer.py
from kafka import KafkaConsumer

consumer = KafkaConsumer('online',	
	group_id='test',
	bootstrap_servers=&quot;localhost:9091,localhost:9092&quot;)

for msg in consumer:
	print msg.value

You can run this a the command line:

python simple_consumer.py

Now I want to make a simple producer that adds a message every second to the topic ‘simple-test’.

#simple_producer.py
from kafka import KafkaProducer
from time import sleep

producer = KafkaProducer(
	bootstrap_servers='localhost:9091,localhost:9092',
	client_id='simple-producer')

for i in range(120):
	msg = &quot;Message {}&quot;.format(i)
	producer.send('simple-test',msg)
	sleep(1)

You can run this a the command line at a separate terminal:

python simple_producer.py

When both are running you should see something like this.

Screen Shot 2016-06-12 at 2.31.17 PM.png

You will notice that order is not ensured!

Online-Learning

I am going to run the sklearn’s SGD classifier using its partial_fit functionality.   For simplicity, I am going to use this example of SGD but split the data generation and fitting by Kafka, and using partial_fit instead of fit.

I am going to transform all the data into JSON for transmission through Kafka, and then the consumer will have to parse the JSON to update the model.

#data_producer.py
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
import json
from kafka import KafkaProducer
from time import sleep

producer = KafkaProducer(
	bootstrap_servers='localhost:9091,localhost:9092',
	client_id='online-producer')

X, Y = make_blobs(n_samples=1200,n_features=2,centers=2, cluster_std=3.0)
temp = np.zeros((X.shape[0],X.shape[1]+1))
temp[:,:X.shape[1]]=X
temp[:,X.shape[1]]=Y
for row in temp:
    post = json.dumps(dict([(i,x) for i,x in enumerate(row.tolist())]))
    producer.send('online',post)
    sleep(0.1)

You can start this script with:

python data_producer.py

For the consumer, we will be updating a model and just print out its running accuracy.   In theory we can do a lot more, but I’m just playing.  Best to keep it simple.

#online_consumer.py
import numpy as np
from sklearn.linear_model import SGDClassifier
from kafka import KafkaConsumer
import json
from sklearn.metrics import roc_auc_score

n_features = 2

clf = SGDClassifier(loss=&quot;hinge&quot;, alpha=0.01, n_iter=20, fit_intercept=True)

consumer = KafkaConsumer('online',
						  group_id='online-consumer',
					      bootstrap_servers=&quot;localhost:9091,localhost:9092&quot;)

count = 0
correct = 0

for msg in consumer:
	data = json.loads(msg.value)
	vec = np.zeros(len(data))
	for key,value in data.iteritems():
		vec[int(key)]=value
	features = vec[:n_features].reshape(1,n_features)
	target = [vec[n_features]]
	if count &gt; 1:
		prediction = clf.predict(features)
		correct += int(prediction[0]==int(target[0]))
	if count &gt; 2:
		print correct/float(count-1)
	clf.partial_fit(features, target,classes=[0,1])
	count += 1
	

 

You can start the consumer in one terminal:

python online_consumer.py

And the producer in the other:

python data_producer.py

Screen Shot 2016-06-12 at 3.12.21 PM.png

Clean Up

You need to kill all your processes.   I usually ctrl-c the python scripts.

To stop Kafka and Zookeeper you can run the shutdown scripts:

./bin/kafka-server-stop.sh
./bin/zookeeper-server-stop.sh

 

Thank You

I just wanted to play this morning, and decided to make it a blog post.  If this is new to you, I hope you learned something.  If you are a master wizard,  I would love to get any advice on doing this better.  Either way, thanks for reading this far.

 

Launching Spark and Zeppelin with Amazon’s Elastic Map Reduce

I have been dealing with projects that are using datasets that don’t fit in memory, or even on my laptop.  I have also been taking a class where  a number of classmates reached out for help setting up an ‘Big Data’ system on AWS.   The class is having us use Pig, Hive, & Spark to analyze healthcare records and apply machine learning and clustering methods to the data.

I found myself repeating myself on the setup process, so I decided to write a post outlining how I have been working with setting up a system that has Hive, Pig, Spark, Zeppelin installed on AWS.

Zeppelin is a relatively new notebook that works with scala, sql, and more!

Step 0 – Sign up For AWS

I feel like this goes without saying, but experience taught me otherwise.   https://aws.amazon.com/

Step 1 – Launch EMR

The first step you need to do is log into your AWS console.  Once in you can click on the EMR icon/link.Slide01.jpg

That will bring you to a page where you can create your own new and shiny cluster.

Slide02.jpg

To install all of the software we want on our cluster we are going to have to go to the advance options.

Slide03.jpg

AWS now supports all the standard big data software out of the box.  Just check what you wanted installed.  In this case I am make sure I have hadoop 2.7, spark 1.6, hive 1.0, pig 0.14, and Zeppelin 0.5.5.

Slide04.jpg

Click next and you will be taken to the Hardware tab.  I always add storage to my EMR clusters.   The rule of thumb I use is I tripple the size of the data I am going to load into the cluster.   I also make sure I have enough space on the master node to store my data before I load it into the cluster.

Slide05.jpg

I also choose my instance states at this point.   For this cluster I am choose r3.xlarge because of the increase memory.  Each node is currently $0.09/hour – so this cluster I am make is $0.45/hour.   A relatively modest cost to analyze 100Gb of data.  You can find the EMR pricing here.   Slide06.jpg

You can click next and give your cluster an identity.  This is also a spot where you can choose bootstrap actions that will allow you to include additional software.  I will skip this for this post.

Slide07.jpg

Now its time to choose your keypair.  This is where if you have not done this before you can cause some issues.  Read this link completely: https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-access-ssh.html

Slide08.jpg

Now that you are done with all the steps you create the cluster and wait!

Step 2 – Edit Security Groups

While you are waiting for the cluster to load you are going to want to make sure you have access to the master node.   You are going to need to ssh in (port 22), access to Zeppelin (port 8890).   From the cluster details window you can click on “security group for Master”

Slide09.jpg

You will want to change the inbound settings so you can contact the server from your IP address.

Slide10.jpg

Click Edit:

Slide11.jpg

Add ssh or custom TCP rules if they do not exist.   I recommend you choose the tab (“My IP”) to limit access from your location.   Bots to scan AWS ports and can/will connect to open connections.   Open up your 22, and 8890 ports.

Slide12.jpg

Now wait until your cluster is ready.

Slide13.jpg

Step 3 – SSH and Load Data

Once your cluster is read and waiting for you, you can ssh into your master node to load your data into the hadoop file system.   I keep my data in private s3 buckets, so I will be taking advantage of the AWS Command Line Tool that comes installed with EMR.

This is where you will need your keypair to ssh into your master node.   I will show the mac/linux method.  Windows users will have to look up the Putty instructions.

FIRST TIME USERS:   be sure to chmod 400 <keypair>.pem before you try to connect to your master node.


ssh -i &amp;amp;lt;keypair&amp;amp;gt;.pem hadoop@&amp;amp;lt;master-node-address&amp;amp;gt;

Slide14.jpg

Now that you are connected to the master node you can copy data into hdfs.  I am first going to copy my class data from an S3 bucket to the local master node.

Slide15.jpg

I want to access the data from Zeppelin, and EMR sets up an Zeppelin user.  To access the data from the zeppelin notebook you have load it into the right HDFS directory .


hdfs dfs -put &amp;amp;lt;data&amp;amp;gt; /user/zeppelin

Slide17.jpg

The data is now loaded into your hadoop file system in the cluster.   It is also accessible from zeppelin.

Step 3 – Use Zeppelin

Now that you have your data loaded you can now take advantage of spark and zeppelin.  Open a browser and type in the web-address of your master-node but connect on port 8890.  Create a new notebook.

Slide18.jpg

There is a spark context automatically loaded.  You can type ‘sc’ to see it.   You can also load your data that is in hdfs in the /user/zeppelin location.  What is nice about Zeppelin is that you can shell commands, mark down, scala, and python at your finger tips.

Slide20.jpg

Step 4 – Save your Notebooks.

Data is not saved in the EMR cluster, so you need to save out before you terminate your cluster.   I always make a new s3 bucket then copy the local notebooks to the s3 bucket.


aws s3 mb s3://zeppelin_notebooks

aws s3 sync /var/lib/zeppelin s3://zeppelin_notebooks

Slide21.jpg

Step 5 – Terminate your Cluster

Now that you have done your big data analysis with your fancy new cluster, you need to terminate your cluster to avoid accruing costs.   From your cluster dashboard you can hit terminate.

Slide22.jpgI always have termination protection on, so I have to turn it off before I terminate.

Slide23.jpg

After a few minutes the cluster will be terminate and your done.

Slide25.jpg

Thank You

I appreciate you taking the time to get this far into my post.   Let me know if it was helpful.

 

Lazy Evaluation of Stochastic Gradient Decent for Logistic Regression With Sparse Data

I am currently taking Big Data For Health Informatics for OMSCS.   We have been working with data summarization in hive, feature engineering with pig, and mapping and reducing model fitting with hadoop.   In the last homework assignment we were referred to a paper titled Lazy Sparse Stochastic Gradient Descent for Regularized Mutlinomial Logistic Regression for which this post is based.

While I was reading it i had done of those forehead slap, “No Duh! I’ve been an idiot.” moments that forecast improvement in both execution and understanding when working with sparse datasets.   Still very much a noob in data science!

In this post I wanted to just show the update rules for SGD that take advantage of sparse data structures with regularization.

Weight Update Functions

As I have gone through the derivation of the logistic cost function and weight updates with Ridge and LASSO regularization at length in previous posts, so I will just post the update rule here:

Ridge Weight Update

w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i - \alpha \lambda w^i

LASSO Weight Update

w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i - \alpha \lambda sign(w^i)

The w^i is the weight for the ith feature, p( y_k = 1 | X,w) is the probability that the feature is a positive example, \alpha is the learning rate for the update, and \lambda is the regularization coefficient.

Stochastic Gradient Decent

Stochastic Gradient Decent (SGD) updates the cost function as each data point comes in, instead of in an iterative batch way.   The summation in the above equations is going to go away, and each data point is going to individually update the weights.  The new update rules follow.

Ridge SGD Weight Update

w^i_t = w^i_t \left( 1 - \alpha \lambda \right) + \alpha \left( y_t - p(y_t=1 | X_t, w_t) \right) X^i_t

LASSO SGD Weight Update

w^i_t = w^i_t - \alpha \lambda sign(w^i_t) + \alpha \left( y_t - p(y_t=1 | X_t, w_t) \right) X^i_t

I have rearranged the regularization term to the front of the equation to highlight how this changes when the data set is sparse.  When there are several data points with a particular feature zero, the regularization is going to decrease the weight.

Ridge SGD Sparse Weight Update

While a particular feature is zero, its weight should be decaying to zero for each new observation.   During this process the equations is the following.

w^i_t = w^i_{t-1} \left( 1 - \alpha \lambda \right)

w^i_t = w^i_{t-2} \left( 1 - \alpha \lambda \right)^2

w^i_t = w^i_{t-3} \left( 1 - \alpha \lambda \right)^3

w^i_t = w^i_{t-n} \left( 1 - \alpha \lambda \right)^n

If we have 300,000 features, we would need to perform 300,000 decays each round, even if only 10 or so features are non-zero.   The above version highlights that we can lazily update the weights when we see a feature is nonzero as long as we keep track of how long it has been since it was updated.

LASSO SGD Sparse Weight Update

Similarly for LASSO regression we see the following update.

w^i_t = w^i_{t-1} - sign(w^i_{t-1}) \alpha \lambda

w^i_t= w^i_{t-2} - 2 sign( {w^i_{t-2}} ) {\alpha \lambda}

w^i_t=w^i_{t-3} - 3 sign( {w^i_{t-3}} ) {\alpha \lambda}

w^i_t=w^i_{t-n} - n sign( {w^i_{t-n}} ) {\alpha \lambda}

w^i_t = w^i_{t-n} \left( 1 - \frac{ n\alpha \lambda }{|w^i_{t-n}|} \right)

If n \lambda \alpha > |w^i_{t-n}| then the weight is set to zero.   This is because weight should not change sign from regularization, and is the clipping that is referred to in the paper.

SVM Light Format

To take advantage of the sparseness in the dataset we need to represent the data in a sparse format.  The format of the data is

<line> .=. <target> <feature>:<value> … <feature>:<value> # <info>

An example of a line from a file is:

1 10:1.000000 39:1.000000 431:0.960630 951:1.000000 981:0.296029 989:0.721311 1001:0.400000

 This row is a positive training example (leading 1) with the 10th feature having a non-zero value of 1.0, the 39th feature having a non-zero value of 1.0, the 431th feature has a value of 0.96, and so on.

This allows a significant compression of data when the data is sparse.  This is a common format for text data, and you can download the Dexter Dataset as an example.

Code

I am going to compare two version of the same LogisticRegressionSGD class.  One with lazy evaluation and one without lazy evaluation.  The code is very similar between the two classes.  I am only going to evaluate ridge regression in this post.  LASSO extension is equally as straight forward.

Lazy Evaluation

class LogisticRegressionSGD:

    def __init__(self, learning_rate, l2_coef):
        &amp;quot;&amp;quot;&amp;quot;
        Initialization of model parameters
        &amp;quot;&amp;quot;&amp;quot;
        self.learning_rate = learning_rate
        self.weight = collections.defaultdict(float)
        self.l2_coef = l2_coef
        self.last_update = collections.defaultdict(int)
        self.count = 1

    def fit(self, X, y):
        &amp;quot;&amp;quot;&amp;quot;
        Update model using a pair of training sample
        &amp;quot;&amp;quot;&amp;quot;
        if self.l2_coef &amp;amp;amp;amp;gt; 0:
            for f, v in X:
                self.weight[f] *= (1-self.l2_coef*self.learning_rate)**(self.count-self.last_update[f])
                self.last_update[f] = self.count

        diff = y - self.predict_prob(X)

        for f, v in X:
            self.weight[f] += diff*v*self.learning_rate
        self.count += 1

    def predict(self, X):
        return 1 if self.predict_prob(X) &amp;amp;amp;amp;gt; 0.5 else 0

    def predict_prob(self, X):
        return 1.0 / (1.0 + math.exp(-math.fsum((self.weight[f]*v for f, v in X))))

Standard Evaluation


import collections
import math

class LogisticRegressionSGDSlow:

    def __init__(self,learning_rate, l2_norm):
        &amp;quot;&amp;quot;&amp;quot;
        Initialization of model parameters
        &amp;quot;&amp;quot;&amp;quot;
        self.learning_rate = learning_rate
        self.weight = collections.defaultdict(float)  #[0.0] * n_feature
        self.l2_norm = l2_norm
        self.last_update = collections.defaultdict(int)

    def fit(self, X, y):
        &amp;quot;&amp;quot;&amp;quot;
        Update model using a pair of training sample
        &amp;quot;&amp;quot;&amp;quot;

        diff = y - self.predict_prob(X)

        if self.l2_norm &amp;amp;amp;amp;gt; 0:
            for f in self.weight:
                self.weight[f] *= (1-self.l2_norm*self.learning_rate)

        for f, v in X:
            self.weight[f] += diff*v*self.learning_rate

    def predict(self, X):
        return 1 if self.predict_prob(X) &amp;amp;amp;amp;gt; 0.5 else 0

    def predict_prob(self, X):
        return 1.0 / (1.0 + math.exp(-math.fsum((self.weight[f]*v for f, v in X))))

The first class is keeping track of how many examples it has been fed, and using an additionally dictionary to track when the last update to that weight was.   The second class iterates through each non-zero weight for each data point and ‘decaying’ it.

Command Line Training

To show and ensure the SGD component of the training, I am going to use the command line piping to load the data into the model. There is a python script to read in stdin and parse the svmlight data. It then trains the SGD logistic mode.

#!/usr/bin/env python

import sys
import random
import pickle
from optparse import OptionParser
from LogisticRegressionSGDSlow import LogisticRegressionSGDSlow

to_float_tuple = lambda x: (int(x[0]),float(x[1]))

if __name__ == '__main__':
    parser = OptionParser()
    parser.add_option(&amp;quot;-l&amp;quot;, &amp;quot;--learning_rate&amp;quot;, action=&amp;quot;store&amp;quot;,
                      dest=&amp;quot;learning_rate&amp;quot;, default=0.01, help=&amp;quot;step size&amp;quot;, type=&amp;quot;float&amp;quot;)
    parser.add_option(&amp;quot;-c&amp;quot;, &amp;quot;--l2_norm&amp;quot;, action=&amp;quot;store&amp;quot;,
                      dest=&amp;quot;l2_norm&amp;quot;, default=0.0, help=&amp;quot;regularization strength&amp;quot;, type=&amp;quot;float&amp;quot;)
    parser.add_option(&amp;quot;-m&amp;quot;, &amp;quot;--model-path&amp;quot;, action=&amp;quot;store&amp;quot;, dest=&amp;quot;path&amp;quot;,
                      default=&amp;quot;logSGD.model&amp;quot;, help=&amp;quot;path where trained classifier was saved&amp;quot;)

    options, args = parser.parse_args(sys.argv)

    model = LogisticRegressionSGDSlow(options.learning_rate, options.l2_norm)

    for line in sys.stdin:
        row = line.split(&amp;quot; &amp;quot;,1)
        y = int(row[0])
        features = row[1]
        X = [ to_float_tuple(pair.split(&amp;quot;:&amp;quot;)) for pair in features.split() ]
        model.fit(X,y)

    with open(options.path, &amp;quot;wb&amp;quot;) as f:
        pickle.dump(model, f)

Training the model on sparse data with 25k rows and 300k features looks like

time cat data/training.data | python train_data_slow.py -c 0.1 -l 0.1

The time output is:

real      0m27.605s

user     0m27.442s

sys       0m0.070s

This took about 30 second to train in the command light.   Using the same code but running on lazy evaluation has a significant speed increase.

time cat data/training.data | python train_data_lazy.py -c 0.1 -l 0.1

real      0m0.959s

user     0m0.844s

sys       0m0.044s

The speed increase in this case is 30x.   The results of the model are very similar, though they never will be exactly the same.   This is because the updates are different, so the final model with be different.

Final Thoughts

In this post I wanted to review a paper I was exposed to in my class that allowed me to understand how taking advantage of the sparse representation of data can lead to performance increase in training a model.   Sklearn in Python and glmnet in R both take advantage of this representation, so there is not any need to do it yourself.   I, just out of habit of mind, like to deconstruct these ideas when I have a chance.

Thanks for Reading.

As always, I hope you gained something from reading my musings.   Feel free to make comments, or let me know any corrections that need to be made.  I have found tons already, but have not caught them all.

Bias-Variance Tradeoff in Linear Regression

Summary

This is a math heavy post where I am using the expectation of the cost function for linear regression to discuss the bias-variance tradeoff in model predictions.   First I give an example of the bias and variance tradeoff, then I jump into the math to derive the tradeoff.

Linear Regression

Linear regression is a supervised learning technique with the goal of prediction a continuous target value based on a features. One might wish to predict the price a home will sell at, and the features could be the the quality and descriptions of the home, as well as the location and various features of the neighbor. In this example, even if we had the perfect model to predict the price of the home, there are other factors that will come into place to alter the results. The motivation of the seller and buyer, the people that happen to see the home and their negotiation temperament, and other psychological effects. You could think of the price a home will sell at as a rational price plus the irrational change do to the buying/selling process.

\Large{ \textrm{Selling Price} = \textrm{True Price} + \textrm{Noise} }

Or more abstractly:

\Large{ y_{Obs} = y_{Truth} + \epsilon }

The truth depends on a combination of weights and features. We can write it as the following function

\Large{ y_{Truth}\left(x,w\right) = w^0 + \sum_{i=1}^{M} w^i \cdot x_i }

\Large{ w^i = \textrm{Weight for ith Feature} }

\Large{ x_i = \textrm{ith Feature} }

where our data has M features, and we allow a non-zero intercept w^0.

The large M is the more complex our model is, and the stronger it will fit whatever data is used to train it.  If the data is not representative of the population it is suppose to represent, the resulting model will be overfitted.  It will also vary a lot from dataset to dataset.   The less number of features, the less it will overfit, but the more biased the resulting model will be.

I can demonstrate this with a few lines of code in python.  I am going to generate some linear data with noise, them I am going to fit the data with a Ridge Linear Regression.  The alpha parameters are to limit the model complexity.  The large alpha, the less complex the model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as st
%matplotlib inline
from sklearn.linear_model import Ridge

#Fit less complex to more complex models
for alpha in [100,10,1]:
    lin = Ridge(alpha=alpha)
    xvals = np.linspace(0,10,101)
    ytrue = (2.5*xvals+10).reshape(101,1)
    ys = []
    plt.figure(figsize=(16,8))
    #Generate 50 Datasets, fit them, and plot them.
    for i in range(50):
        x = np.random.randn(10,1)*2+5
        y = x*2.5 + 10 + np.random.randn(10,1)*10
        lin.fit(x,y)
        yp = lin.predict(xvals.reshape(xvals.shape[0],1))
        ys.append(yp)
        plt.plot(xvals,yp,'b--',alpha=0.1)

    ydata = ys[0].copy()
    for y in ys[1:]:
        ydata += y
    ydata = ydata/50
    plt.title('Mean Var:{0:.3f}      Bias:{1:.3f}'.format(np.array([np.var(y-ydata,axis=0) for y in ys]).mean(),np.var(ytrue-ydata)))
    plt.plot(xvals,ydata,'b-')
    plt.plot(xvals,2.5*xvals+10,'g-')
    plt.ylim(0,50)
    plt.show()

Screen Shot 2016-01-31 at 1.06.03 PM

The light dashed blue lines are the result of each of the 50 fits for each model.  The solid blue line is the average result from all the data sets.  The green line is the underlying model, or ground truth.

From top to bottom see see a less complex model has less variation from dataset to dataset, but the average result is biased.  The middle model has more variation from dataset to dataset, and is less biases.   The bottom model has the most variation from dataset to dataset, but is least biased.

This also illustrated the power of ensemble methods if the underlying model is not biased and the data is sampled in a way that represents the population distribution.

Linear Regression Theory

As mentioned above, Linear regression takes the form:

\Large{ y_{Obs} = y_{Truth}(x,w)+ \epsilon }

It is often assumed that the noise is Gaussian (though it does not have to be assumed). The probability density of getting a particular value can be written as:

\Large{ p\left(y_{Obs} | x,w,\sigma_{\textrm{Noise}}\right) = \frac{1}{\sqrt{2 \pi \sigma^2_{\textrm{Noise}}}} exp \left[ \frac{-\left( y_{Obs} - y_{truth}\left(x,w\right) \right)^2}{2 \sigma^2_{\textrm{Noise}}} \right] }

Since the noise is assumed independent, the probability of getting a set of N observation will just to the multiplication of the individual probabilities.

\Large{ p\left(\bf{y_{Obs}}\right) = \prod_{n=1}^{N} p\left(y^n_{Obs} | x^n,w,\sigma_{\textrm{Noise}}\right) }

This is called the likelihood function. Linear Regression is finding weights that maximum this likelihood for a given set of observations.

For a host of reasons, it is common to take the natural log of the likelihood, generating the log-likelihood.

\Large{ ln \left[ p\left(\bf{y_{Obs}}\right) \right] = \sum_{n=1}^{N} ln\left[p\left(y^n_{Obs} | x^n,w,\sigma_{\textrm{Noise}}\right) \right] }

\Large{ ln \left[ p\left(\bf{y_{Obs}}\right) \right] = \sum_{n=1}^{N} \frac{-1}{2} ln\left[2 \pi \sigma_{\textrm{Noise}}^2\right] + \sum_{n=1}^{N} \frac{-\left( y^n_{Obs} - y_{truth}\left(x^n,w\right) \right)^2}{2 \sigma^2_{\textrm{Noise}}} }

\Large{ ln \left[ p\left(\bf{y_{Obs}}\right) \right] = -\frac{N}{2} ln\left[2 \pi \sigma_{\textrm{Noise}}^2\right] + \frac{-1}{2 \sigma^2_{\textrm{Noise}}} \sum_{n=1}^{N} \left( y^n_{Obs} - y_{truth}\left(x^n,w\right) \right)^2 }

The first term is only dependent on the inherit noise of the system/problem and the number of observations. The right had side of the equation is where our model can make a difference. We want the model the makes this term the smallest.

\Large{ \textrm{Solution} = min_{w} \sum_{n=1}^{N} \left( y^n_{Obs} - y_{truth}\left(x^n,w\right) \right)^2 }

Each of the dashed blue lines in the first figure is a solution to this problem.  Lets now look at the expected cost.

Cost & Expected Cost

The solution of Linear Regression is the model with the least sum squared difference between the model prediction and the observed values. This is commonly called the cost or loss function. Different machine learning methodologies use different cost functions.

We showed above that the cost function for basic linear regression is:

\Large{ Cost = \sum_{n=1}^{N} \left( y^n_{Obs} - y\left(x^n,w\right) \right)^2 }

Where truth has been dropped because we do not know the truth. We are using a model y\left(x,w\right) to predict what we observe.

What is observed is expected to be a continuous range that takes a gaussian distributions around the true value. We also expect that we will see a particular set of data in a data set. I want to define the probability density of getting a particular target and feature as:

\Large{ p\left(x,y_{obs} \right) = \textrm{Probability of observing features and value} }

We can discuss the expected cost of a model then as the integral over all possible data and all possible observed values:

\Large{ E\left[ Cost \right] = \int_{x} \int_{y_{obs}} \left( y_{Obs} - y\left(x,w\right) \right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }

Obviously we want to choose the model y\left(x,w\right) that will give the lowest expected cost. We can differentiate each side of the expected cost and set it to zero to find the model that does this. We will call the solution the optimal model y_{Optimal}

\Large{ \frac{\delta E\left[ Cost \right]}{\delta y\left(x,w\right)} = 2 \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}\left(x,w\right) \right) \cdot p\left(x,y_{Obs} \right) \cdot \delta y\left(x,w\right) dy_{Obs} dx = 0 }

Doing a quit bit of algebra allows us to rewrite the above equation as below.

\Large{ \frac{\delta E\left[ Cost \right]}{\delta y\left(x,w\right)} = 2 \int_{x} \left[ \int_{y_{obs}} y_{Obs} \cdot p\left(x,y_{Obs}\right) dy_{Obs} - y_{Optimal}\left(x,w\right) \int_{y_{obs}} p\left(x,y_{Obs}\right) dy_{Obs} \right] \delta y\left(x,w\right) dx = 0 }

This highlights that:

\Large{ \int_{y_{obs}} y_{Obs} \cdot p\left(x,y_{Obs}\right) - y_{Optimal}\left(x,w\right) \int_{y_{obs}} p\left(x,y_{Obs}\right) dy_{Obs} = 0 }

Then the optimal solution is:

\Large{ y_{Optimal}(x) = \frac{1}{p\left(x\right)} \cdot \int_{y_{obs}} y_{Obs} \cdot p\left(x,y_{Obs}\right) dy_{Obs} }

where

\Large{ p\left(x\right) = \int_{y_{obs}} p\left(x,y_{Obs}\right) dy_{Obs} }

We can make the following simplification by using the the following probablity rule:

\Large{ p(x,y_{Obs}) = p(y_{Obs}|x) \cdot p(x) }

Optimal Model

\Large{ y_{Optimal}(x) = \int_{y_{obs}} y_{Obs} \cdot p\left(y_{Obs} | x\right) dy_{Obs} }

Types of Cost

Now that we have a way to find the optimal model we can rewrite the expected cost.

\large{ E\left[ Cost \right] = \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}(x) + y_{Optimal}(x) - y\left(x,w\right) \right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }

\Large{ E\left[ Cost \right] = \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}\right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }

\Large{ + \int_{x} \int_{y_{obs}} \left(y_{Optimal}(x) - y\left(x,w\right) \right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }

\Large{ + \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}(x)\right) \cdot \left(y_{Optimal} - y\left(x,w\right) \right) p\left(x,y_{Obs} \right) dy_{Obs} dx }

It is relatively easy to show the last/bottom term is zero. That that leads to two types of expected costs. The first term is the irreducible noise.

\large{ \textrm{Expected Irreducible Noise} = \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}(x)\right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }

No matter how good your model, even if it is optimal, there will be observations that different from the ground truth because of the noise. This is the buying and selling processing in selling of a house that will move the selling price from the rational price.

The second term is the model error;

\Large{ \textrm{Expected Model Error} = \int_{x} \int_{y_{obs}} \left(y_{Optimal}(x) - y\left(x,w\right) \right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }

\Large{ = \int_{x} \left(y_{Optimal}(x) - y\left(x,w\right) \right)^2 p\left(x\right) dx }

The result only depends on the distribution of all possible features.

Cost From Data

We are not given a model, but instead derive a model from data. The model will always be different if it is derived from different data. We will notate this y(x,w|D) where D is a given data set.

We can then write the expected model error as:

\Large{ \textrm{Expected Model Cost} = \int_{x} \int_{D} \left(y_{Optimal}(x) - y\left(x,w|D\right) \right)^2 p\left(x,D\right) dx \ dD }

Where the model now depends on the dataset observed, and we integrate over all possible datasets. We can define the expected model based on all possible datasets as:

\Large{ y_{Data}(x,w) = \int_{D} y\left(x,w|D\right) p\left(x,D\right) dD }

As before we can add and subtract the term in expected error.

\Large{ \textrm{Expected Model Cost} = \int_{x} \int_{D} \left(y_{Optimal}(x) -y_{Data}(x) + y_{Data}(x) - y\left(x,w|D\right) \right)^2 p\left(x,D\right) dx \ dD }

\Large{ \textrm{Expected Model Cost} = \int_{x} \int_{D} \left(y_{Optimal}(x) - y_{Data}(x)\right)^2 p\left(x,D\right) dx \ dD }

\Large{ + \int_{x} \int_{D} \left(y_{data}(x) - y(x,w|D)\right)^2 p\left(x,D\right) dx \ dD }

\Large{ + \int_{x} \int_{D} \left(y_{Optimal}(x) - y_{Data}(x)\right) \cdot \left(y_{data}(x) - y(x,w|D)\right) p\left(x,D\right) dx \ dD }

Also as before it is fairly easy to show the last term is zero.

\Large{ \textrm{Expected Model Cost} = \int_{x} \left(y_{Optimal}(x) - y_{Data}(x)\right)^2 p\left(x\right) dx \\ + \int_{x} \int_{D} \left(y_{data}(x) - y(x,w|D)\right)^2 p\left(x,D\right) dx \ dD }

The expected model cost then results from two terms. The first/top term is the expected difference between the optimal model and the model expected to be produced by the data. This is the squared bias associated with the expected model.

\Large{ \textrm{Expected Squared Model Bias} = \int_{x} \left(y_{Optimal}(x) - y_{Data}(x)\right)^2 p\left(x\right) dx }

The second/bottom term in the model error is the difference from each possible model and the expected model generated from the data. This is the varance of the model from the expected model.

\Large{ \textrm{Expected Model Variance} = \int_{x} \int_{D} \left(y_{data}(x) - y(x,w|D)\right)^2 p\left(x,D\right) dx \ dD }

Bias-Variance Tradeoff

The expected cost is then given by:

\Large{ E\left[Cost\right] = Irreducible Noise + Expected Square Model Bias + Expected Model Variance }

The models that we fit on data is somewhere on the spectrum of high biased or high variance.   If we refit the model on new data, we will either be getting a lot of variation from model to model, get a biased result that has less variation.   In practice this is not very useful because we have one data set.

It is also worth noting that the tradeoff is less of a trade off with more data.   Here is the same plots for 10x the data as before.

Screen Shot 2016-01-31 at 2.08.13 PM

Final Thoughts

I think this is interesting, and useful for a deeper theoretical understanding of model fitting, but may not be practical.  This understanding is not directly applicable in a single dataset building a model.  It might be useful in meta analysis doing a post-mortem on a model is in use.  It is definitely useful in trying to understand Bayesian Regression Techniques.  Likely to be another post in the near future.

Thank you

As always, thank you for reading.   Hope it was useful and/or interesting.  Feel free to share.

Profit Curves

I have been talking about performance metrics recently, and I wanted to post/review on something from from chapter 8 of Data Science For Business.

Profit curves are a way of giving a benefit value to correct predictions and cost value to incorrect predictions.  I really like these because it gives a way to evaluate the business impact of a model.   It is often that we can gain epsilon increases in performance, but it may not make a meaningful difference in the bottom line.   Profit curves allow us to see this.

Profit curves also allow us to estimate the optimal threshold, or what fraction of the strongest predictions, to predict positive to maximize the business impact.

Also, profit curves have constructed in a similar way to ROC curves.  ROC plots show the True Positive Rate and False Positive Rate in order of prediction strength.  Profit curves show expected profit in order of prediction strength.

Confusion Matrix

The confusion matrix show the count of target classifications vs prediction classifications in a matrix.   The True Positive Rates and False Positive Rates are metrics that are target based metrics.  This means that they are normalized by the count of either total target positives or total target negatives.

\Large{\left( \begin{array}{cc} TP & FP \\ FN & TN \end{array} \right) \rightarrow \left( \begin{array}{cc} \frac{TP}{TP+FN} & \frac{FP}{FP+TN} \\ \frac{FN}{TP+FN} & \frac{TN}{FP+TN} \end{array} \right)}

An ROC curve would plot the top row of the right-hand matrix in order of prediction strength.

We have the above confusion matrix and normalize by the target values. It allows us to estimate the rate that we get a given prediction is correct or incorrect.  The next step in constructing a profit curve is to normalize the matrix with respect to the population proportions of positive and negative targets.  This will allow us to get a feel for the actual rate of misclassification and correct classification in the populations we are concerned with if we know the population proportions P_+ and P_-.

\Large{\left( \begin{array}{cc} \frac{TP}{TP+FP} \ P_+ & \frac{FP}{FP+TN} \ P_- \\ \frac{FN}{TP+FN} \ P_+ & \frac{TN}{FP+TN} \ P_- \end{array} \right)}

Profit Matrix

Profit curves are created by giving each term in the above altered confusion matrix with a numerical value. A cost or benefit.

\Large{\mbox{Profit Matrix} = \left( \begin{array}{cc} B_{P_+} & C_{P_+} \\ C_{P_-} & B_{P_-} \end{array} \right)}

For illustrations of the calculations, and to develop some intuition, I will use the following example:

\Large{\mbox{Profit Matrix} = \left( \begin{array}{cc} 10 & -2 \\ -2 & 6 \end{array} \right)}

This matrix says a correct positive prediction is worth $10, a correct negative prediction is worth $6, and a misclassification costs $2.

Profit Calculations

I want to do two toy calculations using the example matrix.  One case is where the model predicts everyone positive.  N_+ and N_- are the count of positive and negative targets in the population.

\Large{ \left( \begin{array}{cc} N_+ & N_- \\ 0 & 0 \end{array} \right) => \left( \begin{array}{cc} 1 & 1 \\ 0 & 0 \end{array} \right) }

The 1,1 illustrate that this is the upper right corner of an ROC curve.

If we have a population that has the proportion of positive and negative examples of P_+ = .33 and P_- = 0.67, we have the error rate that looks like:

\Large{\left( \begin{array}{cc} .33 & .67 \\ 0 & 0 \end{array} \right) }

The element wise multipication with our cost matrix gives:

\Large{ \left( \begin{array}{cc} 3.33 & -1.33 \\ 0 & 0 \end{array} \right) }

We sum all the elements of this matrix together to get the expected profit per prediction:

\Large{ E[\mbox{Profit}] = 3.33 - 1.33 + 0 + 0 = 2 }

This strategy is profitable. If we look at the other extreme and predict everything is negative we get the following results:

\Large{ \left( \begin{array}{cc} 0 & 0 \\ N+ & N- \end{array} \right) \rightarrow \left( \begin{array}{cc} 0 & 0 \\ 1 & 1 \end{array} \right) }

\Large{\left( \begin{array}{cc} 0 & 0 \\ .33 & .67 \end{array} \right) }

The element wise multiplication with our cost matrix produces:

\Large{\left( \begin{array}{cc} 0 & 0 \\ -.67 & 4.00 \end{array} \right) }

\Large{E[\mbox{Profit}] = 0 + 0 + -.67 + 4.00 = 3.33}

I have two extreme predictions.  In one case I predict everything to be positive, and make $2 per prediction.  The other case is I predict everything to be negative, and make $3.33 per prediction.

Models can help improve our predictions, and we would like to do two things.

  1. Forecast increase in profits
  2. Choose the model parameters that optimize profits

Flower Shop

For this post I am going to make a profit curve for a flower shop that sells at two locations, but I am trying to automate the shipping. If we can get our best flowers to the right location, we will make 10 dollars, but if we have to pay 2 dollars in shipping to the correct place if the prediction is incorrect. If we get the other flowers to the other location, we get 6 dollars, but again we have to pay 2 dollars in shipping if this prediction is incorrect. The cost benefit matrix is what I have used above.

\Large{ \mbox{Profit Matrix} = \left( \begin{array}{cc} 10 & -2 \\ -2 & 6 \end{array} \right) }

We already know the two extremes in profit, but lets fit some models on the iris data set and make some profit curves.

Model

I am using the Iris data set fit with two models:  Logistics Regression and Support Vector Machines.   I have hamstrung both models because I am not performing a train/test split (for the interest in focus on making profit curves) and to artificially give both models the same Area under the ROC Curve (AUC).

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,roc_curve, roc_auc_score
%matplotlib inline
iris = datasets.load_iris()
X = iris.data # we only take the first two features.
Y = (iris.target==1).astype(int)
svm = SVC(C=.1,probability=True,kernel='linear')
log = LogisticRegression(C=1)
log.fit(X,Y)
svm.fit(X,Y)
y_pred = log.predict(X)
probs = []
p = log.predict_proba(X)[:,1]
probs.append(p)
fpr, tpr, thresholds = roc_curve(Y,p)
plt.figure(figsize=(16,8))
plt.plot(fpr,tpr,label='Log ROC Curve - AUC: %.2f' % roc_auc_score(Y,p),color='steelblue',lw=2)
y_pred = svm.predict(X)
p = svm.predict_proba(X)[:,1]
probs.append(p)
fpr, tpr, thresholds = roc_curve(Y,p)
plt.plot(fpr,tpr,label='SVM ROC Curve - AUC: %.2f' % roc_auc_score(Y,p),color='firebrick',lw=2)
plt.xlabel(&amp;amp;amp;quot;False Positive Rate&amp;amp;amp;quot;)
plt.ylabel(&amp;amp;amp;quot;True Positive Rate&amp;amp;amp;quot;)
plt.legend()
plt.show()

Screen Shot 2016-01-26 at 6.59.43 AM

The SVM has a higher true positive rate for lower false positive rates, while the logistic regression model has a higher true positive rate for larger false positive rates.   Everything predicted positive in the upper right, and everything predicted negative is the lower left.

The AUC is the same (and even if it was not) so we need a way to evaluate the business impact.

Profit Curve Calculations

Now that I have models, and I have scored all the p values, We can look a the confusion matrix.  Because I am using sklearn’s method I want to point out that the confusion matrix has the following format:

\Large{ \left( \begin{array}{cc} TN & FP \\ FN & TP \end{array} \right) }

The rates then become:

\Large{ \left( \begin{array}{cc} TN & FP \\ FN & TP \end{array} \right) \rightarrow \left( \begin{array}{cc} \frac{TN}{TN+FP} \ P_- & \frac{FP}{TN+FP} \ P_- \\ \frac{FN}{TP+FN} \ P_+ & \frac{TP}{TP+FN} \ P_+ \end{array} \right) }

The profit matrix is multiplied element wise with the above matrix.  The results are:

\Large{ \left( \begin{array}{cc} \frac{TN}{TN+FP} \ P_- \ B_- & \frac{FP}{TN+FP} \ P_- \ C_- \ \\ \frac{FN}{TP+FN} \ P_+ \ C_+ & \frac{TP}{TP+FN} \ P_+ \ B_+ \end{array} \right) }

We now sum each element in the matrix and get the expected profit.

Making Profit Curves

To make the profit curve we now just have to go through each prediction ordered by strongest predictions to our weakest predictions.   For each prediction, will say everything as strong or stronger than that prediction is positive, and everything weaker than that prediction is negative.  We then calculate the expected profit as shown above.

names = ['Log','SVM']
colors = ['steelblue','firebrick']
plt.figure(figsize=(16,8))
for i,p in enumerate(probs):
    order = np.argsort(p)
    cost_vals = []
    for pp in p[order]:
            cm = confusion_matrix(Y,(p&amp;amp;amp;gt;=pp))
            #Make Rates
            cmr = cm/cm.sum(axis=1).astype(float).reshape(2,1)
            #Multiply by target proportions
            acmr = cmr*(cm.sum(axis=1).astype(float)/cm.sum()).reshape(2,1)
            #Elementwise multiplication with cost matrix
            cost_vals.append((acmr*cost).sum())

    plt.plot(np.array(range(len(p),0,-1)).astype(float)/len(p),cost_vals,label='%s Curve - AUC: %.2f - Max Profit %.2f' % (names[i], roc_auc_score(Y,p),max(cost_vals)),color=colors[i],lw=2)
plt.xlabel(&amp;amp;amp;quot;Percent Predicted Positive&amp;amp;amp;quot;)
plt.ylabel(&amp;amp;amp;quot;Expected Profit Per Prediction&amp;amp;amp;quot;)
plt.legend()
plt.show()

Screen Shot 2016-01-26 at 7.22.13 AM.png

The profit curve shows that logistic regression has a slightly higher profit than the SVM model, but in reality we should having train/test splits and boot strap errors.   At a first order estimate this gives a feel for the business value a model  can have.  In this case, it allows us to increase profits significantly.

Final Thoughts

There are other business considerations that have be considered in model choice, such as technical debt, model debt, and data debt.   Models do have to be maintained in practice.

While at Galvanize we did a churn analysis similar to this, with a costly intervention strategy for a cell phone provider.   The best performing models are also the most costly to implement, train, and maintain.  Using the wrong thresholds can also cost money.

Screen Shot 2016-01-26 at 7.31.30 AM.png

There is significant gains to be had in this case, but the human and infrastructure costs have to be considered as well.

Thank You

As all ways, thanks for reading.  Hope you enjoyed or gained something.

Implementing Logistic Regression From Scratch – Part 4 : Customizing Cost Function

This is the final post in implementing logistic regression from scratch. The initial motivation for this series was I wanted to post on the theory and basic implementation of logistic regression, but also show that there are cases where one might not want to use the default models provided by a package. In this post I am going give a toy case for this scenario using a wine data set.

My thought process is there are contexts within companies where there are multiple analysis that lead to baseline expectations using separate datasets. In my company we have places where there are multiple distict sources for information that can be used to build the same metrics or models. I imagine this is the case for most data-driven companies. Context always improves the usefulness of information and decision making, and finding ways to incorporate context into models always improves generalization of the results.

In this post I am considering a scenario where there are results from another analysis that set a baseline that I want to be incorporated in the model, but the training data can not be joined with the data used for the previous analysis. In this contexts, the previous analysis is setting a prior. I want the prior to change the logistic cost function so that the results include information about the baseline.

Results of a Previous Analysis

Imagine the contrived example that someone looked at a worldwide sample of wines, and found that the amount of free sulfer dioxide (a perservative) in the wine was related to how likely a wine was to be red than white. The more free sulfur dioxide, the more likely it was to be white. It turned out that the log odds of being red was approximatey linear with respect to the concentration of the free sulfur dioxide.

ln \big( \frac{p_{\textrm{red}}}{p_{\textrm{white}}} \big) \approx w_1 \big(\textrm{Free Sulfur Dioxide}\big) + w_0

Using the national sample the estimate on the coefficient was estimated to be approximately gaussian with the following results:

w_1 = -0.14
SE_{w_1} = 0.005

These results can be used to construct a prior on the coefficient for this weight (but not other weights)

x = np.linspace(-0.10,-0.20,1000)
y = norm.pdf(x,loc=-0.14,scale=0.005)
plt.figure(figsize=(16,8))
plt.plot(x,y,color=&quot;#AA3333&quot;)
plt.ylim(y.min(),y.max()+1)
plt.xlabel('Coefficient Value')
plt.ylabel('Probability Density')
plt.show()

Screen Shot 2016-01-17 at 6.10.55 PM

Ridge Regularization

In the previous posts I covered the theory of Regularization and I implemented LASSO (Laplace Prior) and Ridge (Gaussian Prior) Regularization. In both cases the assumption is all coefficients have the same prior distribution centered around zero. In this case we have a prior on only 1 coefficent, and it is not cented at zero. We can generalize the prior(s) to include this.

Updating Equations

The previous prior we used assumed each weight was centered at zero, and all weights at the same distribution:

\mathcal{P} \propto e^{-\frac{1}{2 \sigma^2}\sum_i {w_i}^2}

If we assume the possibility that the expected weight for each weight is different, and the expected standard deviation is different, this changes to:

\mathcal{P} \propto e^{-\frac{1}{2} \sum_i {\big(\frac{w_i-\bar{w_i}}{\sigma_i}\big)}^2}

That means the negative log prior is:

- ln \mathcal{P} = \frac{1}{2} \sum_i {\big(\frac{w_i-\bar{w_i}}{\sigma_i}\big)}^2 + C

To get an update rule to include this prior(s) we take the derivative:

\frac{\partial}{\partial w_i} \big(- ln \mathcal{P}\big) = \frac{1}{\sigma_i^2} \big(w_i-\bar{w_i}\big)

Including the logisitic weight update rule, we get the following equation:

w_i = w_i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) w_i - \alpha \frac{1}{\sigma_i^2} \big(w_i-\bar{w_i}\big)

The convention is the same as in similar posts: w_i is the the coefficient for the ith feature, w_0 is the intercept, \alpha is the learning rate, \bar{w_i} is the expected value, and the \sigma_i is the standard deviation of the estimate of \bar{w_i}

Our Data

I have a sample of wines from South America, and I want to build a model that will estimate the probability of being a red wine for all wines based on the concentration of free sulfur dioxide. I have a biased sample, so in the effort to generalize I want to include the prior. There are a number of issues that I am going to ignore.

  1. In trying to estimate the out of sample performance, the model without the prior will likely have better performance. A train/test split will both be biased south american samples, not samples that include the distribution of wines in the world.

  2. I have no way to test on a generalize population. I would have to release it into the wild, and monitor its performance.

  3. With enough data, the model will fit to the sample and ignore the prior.

Here’s the data:

wine = pd.read_csv(&quot;wine.csv&quot;)[['free sulfur dioxide','Type']]
wine.head()
free sulfur dioxide Type
0 67 White
1 32 White
2 31 White
3 15 Red
4 25 Red

In the previous posts in this series I wrote classes for logistic regression, but in this post I am going to do it simply with 3 functions.   The first function is the probability estimate using the weights, the second function is the value of the cost (negative log likelihood ) function, and the third function is the gradient of the cost function.

def prob(w,X):
    return 1/(1+np.exp(-X.dot(w)))

def cost(w,X,y,l2,ec):
    return -np.sum(y*np.log(prob(w,X))+(1-y)*np.log(1-prob(w,X)))+l2.dot(np.power(w-ec,2))/2

def grad(w,X,y,l2,ec):
    return X.T.dot(y-prob(w,X))-l2.reshape(l2.size,1)*(w-ec)

The weight column vector will be a feature vector with the zeroth position is the intercept, and other weights are the slopes for the features.  The X is a matrix where the first column is ‘1’ for the intercept.  The y is a column vector of labels.   The l2 is a row vector of regularization parameters.  This allows us to set a different parameter for each feature.  The ec is the expected value from the prior in a column vector.

For our case I am going to use the prior for l2 and ec:

ec = np.array([[0],[-0.14]])
l2 = np.array([0,1/0.005**2])

In a pervious post on regularization, I examined the cost function in the parameter space.  It was easy to see the eventual solution of the parameters that solution would come to.  I showed this with sklearn in that post.  In this post lets look at what would happen to the solution if we did traditional ridge regression with a regularization parameter of 4000 (the equivalent value of \frac{1}{0.005^2}.

I want to load a random sample of 500 wines:

indexes = np.random.choice(wine.shape[0],500)
data = wine.loc[indexes,['free sulfur dioxide']].values
labels = (wine.loc[indexes,'Type'].values=='Red').astype(int)
X = np.hstack((np.ones((data.size,1)),data))
y = labels.reshape((labels.size,1))

Here is the code the contour plot of the cost function is below:

t1 = np.linspace(0,5,100)
t2 = np.linspace(-.3,.0,200)
ecz = np.zeros((2,1))
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
plt.figure(figsize=(16,8))
plt.subplot(121)
l2_temp = np.zeros(2)
ax = plt.gca()
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        w = np.array([w0,w1]).reshape(2,1)
        Z[j,i] = cost(w,X,y,l2_temp,ecz)
levs = np.linspace(np.round(Z.min()+50),np.round(Z.max()-50),6)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Blues_r)
#CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.subplot(122)
l2_temp = np.array([4000.,4000.])
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        w = np.array([w0,w1]).reshape(2,1)
        Z[j,i] = cost(w,X,y,l2_temp,ecz)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Reds_r)
#CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Posterier')
plt.show()

Here is the resulting plot:

Screen Shot 2016-01-17 at 6.39.44 PM

The left plot is a contour plot of the log likelihood, and the right plot is a contour plot of the posterior after a gaussian prior on the weights centered at zero.   The  traditional regulation will pull to the coefficients from about (2,-0.12) to (0,-0.05).  Running this data into sklearn should confirm this:

model = LogisticRegression(C=1e25)
model.fit(data,labels)
print &quot;Unregularized Sklearn: &quot;, model.intercept_[0],model.coef_[0,0]
model = LogisticRegression(C=1/4000.)
model.fit(data,labels)
print &quot;Regularize Sklearn: &quot;, model.intercept_[0],model.coef_[0,0]
Unregularized Sklearn:  1.72872081295 -0.123757551858
Regularize Sklearn:  0.00515437416919 -0.0546835703845

Instead of applying the traditional prior, I want to apply the experimental prior on our likelihood and find the minimum of the posterior function.   First, I want to look at the cost function and posterior function.

t1 = np.linspace(0,5,100)
t2 = np.linspace(-.3,.0,200)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
plt.figure(figsize=(16,8))
plt.subplot(121)
l2_temp = np.array([0,0])
ax = plt.gca()
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        w = np.array([w0,w1]).reshape(2,1)
        Z[j,i] = cost(w,X,y,l2_temp,ec)
levs = np.linspace(np.round(Z.min()+50),np.round(Z.max()-50),6)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.subplot(122)
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        w = np.array([w0,w1]).reshape(2,1)
        Z[j,i] = cost(w,X,y,l2,ec)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Reds_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Posterior')
plt.show()

Screen Shot 2016-01-17 at 6.49.06 PM.png

The left plot is the same plot as above, but the right plot is now the posterior after applying a regularization on only the total sulfur dioxide coefficient.   Sklearn and other packages does not have a way to solve this situation.   That is why I made my own function above.

Solving the Custom Cost Function

I have shown gradient decent a numerous number of times in this series.  This will be the final time!

w = np.zeros((2,1))
i = 0
while np.linalg.norm(grad(w,X,y,l2,ec)) &gt; 1e-6:
    if i % 10000 == 0:
        print cost(w,X,y,l2,ec)[0],np.linalg.norm(grad(w,X,y,l2,ec))
    w = w + 1e-5*grad(w,X,y,l2,ec)
    i += 1
print &quot;Custom Results: &quot;, w[0,0],w[1,0]
738.57359028 11753.013914
183.894939846 1.61330803289
183.861493402 0.033069210393
183.861479334 0.000678928106008
183.861479328 1.39392185238e-05
Custom Results:  1.99719535168 -0.138002425824

These weights represent coefficients that can not be found using logistic regression with the data provided, with or without regularization.  Instead, it reflect the optimal weights given the combination of the previous analysis and the data provided.

Review

In this post I showed how we can make a simple alteration to the cost function to allow priors for non-zero weights and different regularization coefficients.  Using that altered function, I used the results from a prior analysis as the prior for the negative log likelihood constructed from the data.  The results lead to coefficients that are not capable of generating from packages like sklearn, but could produce results that generalize better.

Thank You

I appreciate you taking the time to read this post. I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.