Description: A curated list of mostly free data sets that are available. Most of the links are to websites that you can eventually find that data at. Expected to do a lot of digging to find the data, and be happy if you do not need to!

Description: This is a collection of links for CSV datasets collected from R packages hosted on CRAN. It is not a complete list of all data sets, but it is a good set.

Warning: This tutorial is originally writing in Nov 2016. The rate of change of both Spark and IntelliJ may make these instructions obsolete sooner rather than later. I provide links to sources wherever possible. Post a comment if you do find something obsolete and I will attempt to fix it.

Install Scala Build Tool

The instructions for installing SBT 0.13.5 are found here. Unfortunately the link uses the explicit version, and not ‘latest’, so you may want to click around until you find the latest instructions.

Strictly speaking, you do not need an IDE for Scala, but I highly recommend one. If you are new to Scala, and have build any Spark applications in python or R, you will likely get caught by Scala being strongly typed. An IDE like IntelliJ will make it explicit for you, and catch mistakes as you make them.

Scala IDE Options

There are two options that are worth considering ala Quora and Reddit

Using IntelliJ, you can import the exisiting project after you have cloned it. Open IntelliJ and create a project from an existing source.

Select the cloned location and import it.

Choose import SBT for the existing project.

Click Next, choose your preferences, and click finished. The project will take a little time to import and download supporting files. After all is done you should see an open project with the following structure.

Review build.sbt Structure

If you are new to Scala than I think you might be curious about this build.sbt file. It is used to sets up our project and describes the libraries we want to include in our project. Here is the file with some additional comments

//Defines the version of our application and the scala version
//Change scala version for your project
// Spark 2+ Scala 2.11, Spark 1.6 Scala 2.10
val globalSettings = Seq(
version := "1.0",
scalaVersion := "2.11.8"
)
//Defines locations to look for dependencies
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
resolvers += "Repo at github.com/ankurdave/maven-repo" at "https://github.com/ankurdave/maven-repo/raw/master"
//Defines project structure.
lazy val sparkScalaTemplate = (project in file("."))
.settings(name := "SparkAppName-ToBeReplaced")
.settings(globalSettings:_*)
.settings(libraryDependencies ++= dependencies)
//Defines the version of spark we are going to use
val sparkVersion = "2.0.1"
//Lists the dependencies
lazy val dependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
// %% means auto include the Scala verion
// "org.apache.spark" % "spark-mllib_2.11" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion,
"org.apache.spark" %% "spark-graphx" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion
)
//Defines how to include jars in the UberJar
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) =&amp;amp;gt; MergeStrategy.discard
case x =&amp;amp;gt; MergeStrategy.first
}

The first thing you are likely to do is want add additional libraries to integrate with your project or to integrate with your production system. Maven is going to be the best and first source for these libraries. You can see the different spark projects here: https://mvnrepository.com/artifact/org.apache.spark

To get the documentation on how to use this library I would do a google search of ‘org.postgresql javadocs’. The first result is https://jdbc.postgresql.org/documentation/publicapi/. I will let you engage in the thrill of discovery in how to implement this.

Spark PI

Now that we have a project, lets quickly get to implementing and running the PI example. I have included it in the github template with a minor addition.

package org.apache.spark.examples
import scala.math.random
import org.apache.spark.sql.SparkSession
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.master("local[*]") //Added from originals for local run in IntelliJ
.appName("Spark Pi")
.getOrCreate()
val slices = if (args.length &gt; 0) args(0).toInt else 2
val n = math.min(1000000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.sparkContext.parallelize(1 until n, slices).map { i =&gt;
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y &lt;= 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / (n - 1))
spark.stop()
}
}

Run

With SparkPi.scala open in the IntelliJ IDE you can ‘Run’ the app.

The output is below:

Pi is roughly 3.1416575708287855

Build and Run any Spark Example

So now I’m going to work through pulling all of the Spark Examples so you can build and play with yourself.

First we are going to clone a portion of the repo, and restructure the files to fit into the IntelliJ project format.

Now we want to create a new Scala-SBT project in this directory.

Now you can replace the build.sbt with the below code.

val globalSettings = Seq(
version := "1.0",
scalaVersion := "2.11.8"
)
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
resolvers += "Repo at github.com/ankurdave/maven-repo" at "https://github.com/ankurdave/maven-repo/raw/master"
lazy val sparkScalaTemplate = (project in file("."))
.settings(name := "SparkExamples")
.settings(globalSettings:_*)
.settings(libraryDependencies ++= dependencies)
val sparkVersion = "2.0.1"
lazy val dependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion,
"org.apache.spark" %% "spark-graphx" % sparkVersion,
"com.github.scopt" %% "scopt" % "3.5.0"
)

When you save this IntelliJ will ask if you want to auto import and/or refresh. I set it to do this automatically. There may be some red warnings. Ignore them for a moment.

Go ahead and open SparkKMeans.scala and run it.

I get a few errors having to do with the examples using previous versions of the Spark API. There are only three files.

LogisticRegressionWithElasticNetExample.scala
comment out .setFamily("multinomial") on line 54
change coefficientMatrix to coefficients on line 59
change interceptMatrix to intercept on line 60
MulticlassLogisticRegressionWithElasticNetExample.scala
change coefficientMatrix to coefficients on line 50
change interceptMatrix to intercept on line 51
SparkSQLExample.scala
change df.createGlobalTempView("people") to df.createTempView("people")

Now when you try to run SparkKMeans.scala you will get the following error because we have not supplied arguments.

Usage: SparkKMeans <file> <k> <convergeDist>
Process finished with exit code 1

In the top right of the IDE we can edit and change the configurations.

And add the following arguments

data/mllib/kmeans_data.txt 2 1.0

The last thing we need to do is add a master(“local[*]”) to the spark session builder.

val spark = SparkSession
.builder
.master("local[*]")
.appName("SparkKMeans")
.getOrCreate()

Now we can run the example and get the following output.

Final centers:
DenseVector(0.2, 0.2, 0.2)
DenseVector(0.1, 0.1, 0.1)

Conclusion

At this point you should be able to run all the Spark Examples with some minor edits. All the examples will need to add master(“local[*]”) in the building the spark session. Some of the examples will require the you provide arguments.

This a a summary of the presentation I gave at the October OC R User Group Meet Up. The focus on the talk was to solve the following problem statement using R.

I am investing in Lending Club notes. Specifically Grade A, 36 Month loans. I want to generate an expected cash flow for my investment.

Note

Sometimes wordpress.com will format code like the pipes and R assignment incorrectly. You can find the code for this post here: https://github.com/bryantravissmith/OC-RUG

Setup

I want to generate my expectations for grade A, 36 months loan. The main goal is to generate a framework for the solution that it can also be applied to other loans. I am starting off creating a data frame from all the join Lending Club Data from when I selected the target loans from tranches of loan that are mature.

library(dplyr)
library(reshape2)
library(ggplot2)
library(zoo)
library(stringr)
library(knitr)
library(DT)
# Previously loaded all LC data into a single data frame and saved as RData
load("/opt/science/Datasets/LendingClub/2016-06-30/lcCombinded.RData")
lcA <- lc %>% mutate(
#Crate a Data Object for the Issue Date
issue_yearmon = as.Date(as.yearmon(issue_d,format="%b-%Y")),
#Create a Numeric Issue Year For Group By Calculations
issue_year = as.numeric(as.character(issue_yearmon,format="%Y")),
#Convert to Date object to calculate the 'Age' of the loan
last_pymnt_yearmon = as.Date(as.yearmon(last_pymnt_d,format="%b-%Y")),
#Converte interest rate to ta numeric value
interest = as.numeric(str_replace(int_rate,"%",""))
) %>%
mutate(
#Make what are roughly 1 month wide in the time between the loan was
#originated and the time it stopped generating cash flows
AgeBucket = cut(as.numeric((last_pymnt_yearmon-issue_yearmon)/365),
breaks=seq(0,6,1/12),
include.lowest=T,
right=F)
) %>%
mutate(
#Converte the Age Bucket to a Numeric Month or Statement value
Age = match(AgeBucket,levels(AgeBucket))
) %>%
mutate(
#Get the total recieved payments for each loan
total_rec = total_rec_prncp+total_rec_int
) %>%
#Limit to the data we are interested in for the problem
#Remove 2007 - stands out in graphs, actually doesn't change the results too much.
filter(term == " 36 months",grade == "A", issue_year < 2013, issue_year > 2007)

Cash Flow Expectation : Paid As Agreed

The simplest expectation is to just assume that everyone will make every schedule payment. The equation for the payment of a fix term loan is:

You can then use this to get the amortization schedule for the loan.

# Generates the expected payment for a fixed term loan.
payment_value <- function(loan_amount,interest_rate,term_number) {
loan_amount*interest_rate/12/(1 - (1+interest_rate/12)^(-1*term_number))
}
#Gereates the scheduled payment behavior for a loan that is paid as agreed
get_amortization <- function(term_number,interest_rate){
amortization <- data.frame(statement = seq(0,term_number,1),
payment = c(0,rep(payment_value(1,interest_rate,term_number),term_number)),
interest_payment = 0,
principal_payment = 0,
start_balance = 0,
end_balance = 1)
for(i in (1:term_number+1)){
amortization[i,'start_balance'] = amortization[i-1,'end_balance']*(1+interest_rate/12)
amortization[i,'interest_payment'] = interest_rate*amortization[i-1,'end_balance']/12
amortization[i,'principal_payment'] = amortization[i,'payment'] - amortization[i,'interest_payment']
amortization[i,'end_balance'] = amortization[i,'start_balance']-amortization[i,'payment']
}
amortization[term_number+1,'end_balance']=0
return(amortization)
}
#Helper function to round numeric values when writing to data table
round_df <- function(df, digits) {
nums <- vapply(df, is.numeric, FUN.VALUE = logical(1))
df[,nums] <- round(df[,nums], digits = digits) (df) } #Printing values get_amortization(36,0.078) %>%
round_df(4) %>%
kable(format = "markdown")

statement

payment

interest_payment

principal_payment

start_balance

end_balance

0

0.0000

0.0000

0.0000

0.0000

1.0000

1

0.0312

0.0065

0.0247

1.0065

0.9753

2

0.0312

0.0063

0.0249

0.9816

0.9504

3

0.0312

0.0062

0.0251

0.9565

0.9253

4

0.0312

0.0060

0.0252

0.9313

0.9001

5

0.0312

0.0059

0.0254

0.9059

0.8747

6

0.0312

0.0057

0.0256

0.8803

0.8491

7

0.0312

0.0055

0.0257

0.8546

0.8234

8

0.0312

0.0054

0.0259

0.8287

0.7975

9

0.0312

0.0052

0.0261

0.8027

0.7714

10

0.0312

0.0050

0.0262

0.7764

0.7452

11

0.0312

0.0048

0.0264

0.7500

0.7188

12

0.0312

0.0047

0.0266

0.7235

0.6922

13

0.0312

0.0045

0.0267

0.6967

0.6655

14

0.0312

0.0043

0.0269

0.6698

0.6386

15

0.0312

0.0042

0.0271

0.6427

0.6115

16

0.0312

0.0040

0.0273

0.6154

0.5842

17

0.0312

0.0038

0.0274

0.5880

0.5567

18

0.0312

0.0036

0.0276

0.5604

0.5291

19

0.0312

0.0034

0.0278

0.5326

0.5013

20

0.0312

0.0033

0.0280

0.5046

0.4733

21

0.0312

0.0031

0.0282

0.4764

0.4452

22

0.0312

0.0029

0.0284

0.4481

0.4168

23

0.0312

0.0027

0.0285

0.4195

0.3883

24

0.0312

0.0025

0.0287

0.3908

0.3596

25

0.0312

0.0023

0.0289

0.3619

0.3307

26

0.0312

0.0021

0.0291

0.3328

0.3016

27

0.0312

0.0020

0.0293

0.3035

0.2723

28

0.0312

0.0018

0.0295

0.2740

0.2428

29

0.0312

0.0016

0.0297

0.2444

0.2131

30

0.0312

0.0014

0.0299

0.2145

0.1833

31

0.0312

0.0012

0.0301

0.1845

0.1532

32

0.0312

0.0010

0.0302

0.1542

0.1230

33

0.0312

0.0008

0.0304

0.1238

0.0925

34

0.0312

0.0006

0.0306

0.0931

0.0619

35

0.0312

0.0004

0.0308

0.0623

0.0310

36

0.0312

0.0002

0.0310

0.0312

0.0000

Results: Paid As Agreed

I am now going to use the Lending Club data to calculate actual yield of grade A, 36 month loans and compare it to the expectations generated by the above payment schedule. I will do this calculation by year for each year of mature loans in Lending Club. We are assuming there is no prepayment and charge off behavior, so I will also calculate these to see scale of these behaviors.

lcA %>%
# Perform the following calculation for each year
group_by(issue_year) %>%
# Aggregrate for each year
summarise(
#Number of Loans
Count = n(),
#Dollars Loaned Out
DollarsFunded = sum(funded_amnt),
#Dollars collected from payments
DollarsRecieved = sum(total_rec),
#Percent Charge Offs
Chargeoffs = round(sum(ifelse(grepl("Charged Off",loan_status),1,0))/Count*100,1),
#Percent Prepayments
Prepays = round(sum(ifelse(grepl("Fully Paid",loan_status)&(Age < 36),1,0))/Count*100,1), #The Percent Increase on the Dollars Loans Yield = round(100*(DollarsRecieved/DollarsFunded-1),1), #Dollar weighted interest rate for each year Interest = round(weighted.mean(interest,funded_amnt),1)) %>%
mutate(
#Applys the amortization schedule and sumes payments recieved for each loan
ExpectedYield = apply( (.),
1,
function(x) sum(get_amortization(36,
as.numeric(x['Interest'])/100)$payment))) %>%
mutate(
#Rounds Expected Yield to a percent
ExpectedYield = round(100*(ExpectedYield-1),1)) %>%
#Filter to relevant variables
select(issue_year,DollarsFunded,Prepays,Chargeoffs,Interest,Yield,ExpectedYield) %>%
#Convert to % for display
mutate(Prepays=paste0(Prepays,"%"),
Chargeoffs=paste0(Chargeoffs,"%"),
Interest=paste0(Interest,"%"),
Yield=paste0(Yield,"%"),
ExpectedYield = paste0(ExpectedYield,"%")) %>%
#Create a data table
kable( format = "markdown")

issue_year

DollarsFunded

Prepays

Chargeoffs

Interest

Yield

ExpectedYield

2008

1899800

49.4%

6%

8.5%

7.6%

13.6%

2009

8700550

49.7%

6.7%

8.8%

7.6%

14.1%

2010

20650050

53.1%

4.4%

7.2%

7.4%

11.5%

2011

49784200

50.9%

6.4%

7.2%

6.4%

11.5%

2012

119545400

52.4%

7.2%

7.7%

6.4%

12.3%

Review: Paid As Agreed

I can see that there are signficant Prepayments and Charge Offs for each year of Lending Club, so I’ll try to include these in the expectations.

Cash Flow Expectation : Account For Charge Offs

I am going to include charge offs in the payment expectations. Once a loan charges off, it no longer makes payments. This makes it straight forward to include in the amortatization schedule because we only need to keep track of how many loans are left, and add the payments of the left over loans. In reality there is also deliquency behavior that should be accounted for, but we will have to do that another time.

To execute this, I need to get get a feel for when the loans charge off in a given traunch of loans.

# Charege Offs
lcA %>%
#Perform the following calculations for each year
group_by(issue_year) %>%
## Count the number of leans each year
mutate(count=n()) %>%
## Perform a summary for each year and age of loan
group_by(issue_year,Age) %>%
## Find the number of loans that charge off each month
summarise(bad = sum(ifelse(grepl("Charged Off",loan_status),1,0)),
#Get the count value for the year
#count is a vector of the save value for each group
count = max(count)) %>%
## ORder by the age for the cumsum
arrange(Age) %>%
## Get the cumlative total of bad loans over the course of the term
mutate(total_bad = cumsum(bad)) %>%
## Group by the traunch of loans
group_by(issue_year) %>%
##Get the total number of bad loans for the traunch
mutate(max_total_bad = max(total_bad)) %>%
##Plot the cumlative percentage of chargeoffs for each traunch
ggplot(aes(Age,total_bad/max_total_bad,color=factor(issue_year)))+
geom_line()+
theme_bw()+
xlim(0,36)+
ylab('Percent Total Chargeoffs')

The cumulative charge off behavior has a characteristic shape, so we can use this shape to include in the expected number of units making payments in a given month. We can generate this expectations by just doing a weighted average on these curves (thought there are more thoughtful ways to do this as well). I’m going to take this average, and smoothing fit, and make a function of out it to include in the amortization of the loan.

percentChargeoffFunc <- lcA %>%
#Perform the following calculations for each year
group_by(issue_year) %>%
## Count the number of leans each year
mutate(count=n()) %>%
## Perform a summary for each year and age of loan
group_by(issue_year,Age) %>%
## Find the number of loans that charge off each month
summarise(bad = sum(ifelse(grepl("Charged Off",loan_status),1,0)),
#Get the count value for the year
#count is a vector of the save value for each group
count = max(count)) %>%
## ORder by the age for the cumsum
arrange(Age) %>%
## Limit to 36 months
filter(Age <= 36) %>%
## Get the cumlative total of bad loans over the course of the term
mutate(total_bad = cumsum(bad)) %>%
## Group by the traunch of loans
group_by(issue_year) %>%
##Get the total number of bad loans for the traunch
mutate(max_total_bad = max(total_bad)) %>%
##Calculate percent bad
mutate(percent_bad = total_bad/max_total_bad) %>%
## Group by Age to do the weighted average of all the curves
group_by(Age) %>%
## Calculate the weighted average by the traunch size
summarise(avg_percent_total_bad = weighted.mean(percent_bad,count)) %>%
## Return Approx Function to describe Chargeoff Behavior
(function(df) {
## Make a loess fit of the average to smooth out
mod <- loess(avg_percent_total_bad ~ Age,df,span=1/4)
## make the predictions
pred <- predict(mod,df)
## create a data frame of the age and smoothed predictions
tmp <- data.frame(Age = df$Age,pred_percent_total_bad=pred) ## make sure the new smooth values are between 0 and 1 tmp$pred_percent_total_bad = tmp$pred_percent_total_bad-min(tmp$pred_percent_total_bad) tmp$pred_percent_total_bad = tmp$pred_percent_total_bad/max(tmp$pred_percent_total_bad) # return approx function of the smooth function approxfun(tmp$Age,tmp$pred_percent_total_bad,method='linear',yleft = 0,yright=1) }) ## Plot Function Output data.frame(Age=seq(0,36), pco = percentChargeoffFunc(seq(0,36))) %>%
ggplot(aes(Age,pco))+
geom_line()+
theme_bw()+
xlim(0,36)+
ylab('Percent Total Chargeoffs')

Now that I have the function, I can include the expected charge off function times the actual scale to update our cash flow expectations.

#Ammortizaiton Schedule for loans with Charge Off Expectations
get_amortization_chargeoff <- function(term_number,interest_rate,percent_chargeoff){
amortization <- data.frame(statement=seq(0,term_number,1),
payment=c(0,rep(payment_value(1,interest_rate,term_number),term_number)),
interest_payment = 0,
principal_payment=0,
start_balance=0,
end_balance=1,
#Added a units chargeoff which is the new curve times the measured scale
unit_chargeoff = percent_chargeoff*percentChargeoffFunc(seq(0,36)))
for(i in (1:term_number+1)){
amortization[i,'start_balance'] = amortization[i-1,'end_balance']*(1+interest_rate/12)
amortization[i,'interest_payment'] = interest_rate*amortization[i-1,'end_balance']/12
amortization[i,'principal_payment'] = amortization[i,'payment'] - amortization[i,'interest_payment']
amortization[i,'end_balance'] = amortization[i,'start_balance']-amortization[i,'payment']
}
amortization[term_number+1,'end_balance']=0
#Adds a payment recieved which is the schedule payments times the number of units left on a given statement
amortization <- amortization %>% mutate(payment_recieved = (1-unit_chargeoff)*payment)
return(amortization)
}
#Print output
get_amortization_chargeoff(36,0.07,0.06) %>%
round_df(4) %>%
kable( format = "markdown")

statement

payment

interest_payment

principal_payment

start_balance

end_balance

unit_chargeoff

payment_recieved

0

0.0000

0.0000

0.0000

0.0000

1.0000

0.0000

0.0000

1

0.0309

0.0058

0.0250

1.0058

0.9750

0.0000

0.0309

2

0.0309

0.0057

0.0252

0.9806

0.9498

0.0008

0.0309

3

0.0309

0.0055

0.0253

0.9553

0.9244

0.0019

0.0308

4

0.0309

0.0054

0.0255

0.9298

0.8989

0.0033

0.0308

5

0.0309

0.0052

0.0256

0.9042

0.8733

0.0051

0.0307

6

0.0309

0.0051

0.0258

0.8784

0.8475

0.0070

0.0307

7

0.0309

0.0049

0.0259

0.8525

0.8216

0.0088

0.0306

8

0.0309

0.0048

0.0261

0.8264

0.7955

0.0107

0.0305

9

0.0309

0.0046

0.0262

0.8002

0.7693

0.0130

0.0305

10

0.0309

0.0045

0.0264

0.7738

0.7429

0.0152

0.0304

11

0.0309

0.0043

0.0265

0.7472

0.7163

0.0172

0.0303

12

0.0309

0.0042

0.0267

0.7205

0.6896

0.0191

0.0303

13

0.0309

0.0040

0.0269

0.6937

0.6628

0.0212

0.0302

14

0.0309

0.0039

0.0270

0.6667

0.6358

0.0235

0.0302

15

0.0309

0.0037

0.0272

0.6395

0.6086

0.0258

0.0301

16

0.0309

0.0036

0.0273

0.6122

0.5813

0.0282

0.0300

17

0.0309

0.0034

0.0275

0.5847

0.5538

0.0305

0.0299

18

0.0309

0.0032

0.0276

0.5570

0.5261

0.0327

0.0299

19

0.0309

0.0031

0.0278

0.5292

0.4983

0.0345

0.0298

20

0.0309

0.0029

0.0280

0.5012

0.4704

0.0363

0.0298

21

0.0309

0.0027

0.0281

0.4731

0.4422

0.0383

0.0297

22

0.0309

0.0026

0.0283

0.4448

0.4139

0.0399

0.0296

23

0.0309

0.0024

0.0285

0.4164

0.3855

0.0414

0.0296

24

0.0309

0.0022

0.0286

0.3877

0.3569

0.0429

0.0296

25

0.0309

0.0021

0.0288

0.3589

0.3281

0.0447

0.0295

26

0.0309

0.0019

0.0290

0.3300

0.2991

0.0469

0.0294

27

0.0309

0.0017

0.0291

0.3008

0.2700

0.0490

0.0294

28

0.0309

0.0016

0.0293

0.2715

0.2407

0.0510

0.0293

29

0.0309

0.0014

0.0295

0.2421

0.2112

0.0528

0.0292

30

0.0309

0.0012

0.0296

0.2124

0.1815

0.0543

0.0292

31

0.0309

0.0011

0.0298

0.1826

0.1517

0.0556

0.0292

32

0.0309

0.0009

0.0300

0.1526

0.1217

0.0567

0.0291

33

0.0309

0.0007

0.0302

0.1224

0.0916

0.0577

0.0291

34

0.0309

0.0005

0.0303

0.0921

0.0612

0.0586

0.0291

35

0.0309

0.0004

0.0305

0.0616

0.0307

0.0594

0.0290

36

0.0309

0.0002

0.0307

0.0309

0.0000

0.0600

0.0290

Results: Account For Charge Offs

lcA %>%
# Perform the following calculation for each year
group_by(issue_year) %>%
# Aggregrate for each year
summarise(
#Number of Loans
Count = n(),
#Dollars Loaned Out
DollarsFunded = sum(funded_amnt),
#Dollars collected from payments
DollarsRecieved = sum(total_rec),
#Percent Charge Offs
Chargeoffs = round(sum(ifelse(grepl("Charged Off",loan_status),1,0))/Count*100,1),
#Percent Prepayments
Prepays = round(sum(ifelse(grepl("Fully Paid",loan_status)&(Age < 36),1,0))/Count*100,1), #The Percent Increase on the Dollars Loans Yield = round(100*(DollarsRecieved/DollarsFunded-1),1), #Dollar weighted interest rate for each year Interest = round(weighted.mean(interest,funded_amnt),1)) %>%
mutate(
#Applys the amortization schedule and sumes payments recieved for each loan
ExpectedYield = apply( (.),
1,
function(x) sum(get_amortization_chargeoff(36,
as.numeric(x['Interest'])/100,
as.numeric(x['Chargeoffs'])/100
)$payment_recieved))) %>%
mutate(
#Rounds Expected Yield to a percent
ExpectedYield = round(100*(ExpectedYield-1),1)
) %>%
#Filter to relevant variables
select(issue_year,DollarsFunded,Prepays,Chargeoffs,Interest,Yield,ExpectedYield) %>%
#Convert to % for display
mutate(Prepays=paste0(Prepays,"%"),
Chargeoffs=paste0(Chargeoffs,"%"),
Interest=paste0(Interest,"%"),
Yield=paste0(Yield,"%"),
ExpectedYield = paste0(ExpectedYield,"%")) %>%
#Create a data table
kable( format = "markdown")

issue_year

DollarsFunded

Prepays

Chargeoffs

Interest

Yield

ExpectedYield

2008

1899800

49.4%

6%

8.5%

7.6%

10%

2009

8700550

49.7%

6.7%

8.8%

7.6%

10.1%

2010

20650050

53.1%

4.4%

7.2%

7.4%

8.9%

2011

49784200

50.9%

6.4%

7.2%

6.4%

7.7%

2012

119545400

52.4%

7.2%

7.7%

6.4%

8%

Review: Account For Charge Offs

I see that the expected values have decreased from the previous expectations. The expected and actuals are closer in line, but there is room for improvement. I have accounted for the charge offs, but not the prepayment behavior.

Cash Flow Expectation : Account For Prepayments & Charge Offs

I can now repeat the process I did for including the effect of charge offs in our cash flow expectations. I can see if there is a characteristic cumulative behavior, generate a function that approximates the behavior, and incorporate it into the amortization.

# Prepays
lcA %>%
## Perform the following calculations for each year
group_by(issue_year) %>%
## Count the number of leans each year
mutate(count=n()) %>%
## Perform a summary for each year and age of loan
group_by(issue_year,Age) %>%
## Find the number of loans that prepayments each month
summarise(prepay = sum(ifelse(grepl("Fully Paid",loan_status),1,0)),
## Get the count value for the year
## count is a vector of the save value for each group
count = max(count)) %>%
## ORder by the age for the cumsum
arrange(Age) %>%
## Can only prepay before the loan is mature
filter(Age < 36) %>%
## Get the cumlative total of prepaid loans over the course of the term
mutate(total_prepay = cumsum(prepay)) %>%
## Group by the traunch of loans
group_by(issue_year) %>%
## Get the total number of prepayments for the traunch
mutate(max_total_prepay = max(total_prepay)) %>%
## Plot the cumlative percentage of prepaid loans for each traunch
ggplot(aes(Age,total_prepay/max_total_prepay,color=factor(issue_year)))+
geom_line()+
theme_bw()+
xlim(0,36)+
ylab('Percent Total Prepayment')

The cumulative prepayment behavior over the life does have a characteristic shape, so I will generate an approximation function similar to charge offs.

percentPrepaymentFunc <- lcA %>%
## Perform the following calculations for each year
group_by(issue_year) %>%
## Count the number of leans each year
mutate(count=n()) %>%
## Perform a summary for each year and age of loan
group_by(issue_year,Age) %>%
## Find the number of loans that prepayments each month
summarise(prepay = sum(ifelse(grepl("Fully Paid",loan_status),1,0)),
## Get the count value for the year
## count is a vector of the save value for each group
count = max(count)) %>%
## ORder by the age for the cumsum
arrange(Age) %>%
## Can only prepay before the loan is mature
filter(Age < 36) %>%
## Get the cumlative total of prepaid loans over the course of the term
mutate(total_prepay = cumsum(prepay)) %>%
## Group by the traunch of loans
group_by(issue_year) %>%
## Get the total number of prepayments for the traunch
mutate(max_total_prepay = max(total_prepay)) %>%
## Calculate the percentage of prepayments that happen in a given
## month out of all of the prepayments that happen in the traunch
mutate(percent_prepay = total_prepay/max_total_prepay) %>%
## Calculate for eeach age
group_by(Age) %>%
## The weighted mean of the percentage by the size of the traunch
summarise(avg_percent_total_prepay = weighted.mean(percent_prepay,count)) %>%
## Return an approx function
(function(df) {
## perform a loess fit to smoooth the function
mod <- loess(avg_percent_total_prepay ~ Age,df,span=1/4)
## predict the results
pred <- predict(mod,df)
## create a data frame to use to make the approx function
tmp <- data.frame(Age = df$Age,pred_percent_total_prepay=pred) ## Make sure the fit produces values between 0 and 1 tmp$pred_percent_total_prepay = tmp$pred_percent_total_prepay-min(tmp$pred_percent_total_prepay) tmp$pred_percent_total_prepay = tmp$pred_percent_total_prepay/max(tmp$pred_percent_total_prepay) ## Return the approx function approxfun(tmp$Age,tmp$pred_percent_total_prepay,method='linear',yleft = 0,yright=1) }) ## Plot the approx functions data.frame(Age=seq(0,36), ppp = percentPrepaymentFunc(seq(0,36))) %>%
ggplot(aes(Age,ppp))+
geom_line()+
theme_bw()+
xlim(0,36)+
ylab('Percent Total Chargeoffs')

Results: Account For Prepayments & Charge Offs

I have the characteristic prepayment function and charge off function, so I can append it to the amortization schedule. Like loans that charge off, a prepaid loan does not generate payments after it is prepaid. Unlike charged off loans, it does return the principal. Once this is included, I suspect I will see higher than previously expected cashflows earlier due to returned principal.

#Ammortizaiton Schedule for loans with Charge Off and Prepayment Expectations
get_amortization_chargeoff_prepay <- function(term_number,interest_rate,percent_chargeoff,percent_prepay){
amortization <- data.frame(statement=seq(0,term_number,1),
payment=c(0,rep(payment_value(1,interest_rate,term_number),term_number)),
interest_payment = 0,
principal_payment=0,
start_balance=0,
end_balance=1,
#Units chargeoffed is measures scale * expected percent of total
unit_chargeoff = percent_chargeoff*percentChargeoffFunc(seq(0,36)),
#Units prepyament is measured scale * expected percent of total
unit_prepay = percent_prepay*percentPrepaymentFunc(seq(0,36)))
for(i in (1:term_number+1)){
amortization[i,'start_balance'] = amortization[i-1,'end_balance']*(1+interest_rate/12)
amortization[i,'interest_payment'] = interest_rate*amortization[i-1,'end_balance']/12
amortization[i,'principal_payment'] = amortization[i,'payment'] - amortization[i,'interest_payment']
amortization[i,'end_balance'] = amortization[i,'start_balance']-amortization[i,'payment']
}
amortization[term_number+1,'end_balance']=0
amortization <- amortization %>%
mutate(monthly_prepay = ifelse(is.na(lag(unit_prepay)),0,unit_prepay-lag(unit_prepay))) %>%
mutate(payment_recieved = (1-unit_chargeoff-unit_prepay)*payment+monthly_prepay*(end_balance+payment))
return(amortization)
}
get_amortization_chargeoff_prepay(36,0.07,0.06,0.50) %>%
round_df(4) %>%
kable( format = "markdown")

statement

payment

interest_payment

principal_payment

start_balance

end_balance

unit_chargeoff

unit_prepay

monthly_prepay

payment_recieved

0

0.0000

0.0000

0.0000

0.0000

1.0000

0.0000

0.0000

0.0000

0.0000

1

0.0309

0.0058

0.0250

1.0058

0.9750

0.0000

0.0000

0.0000

0.0309

2

0.0309

0.0057

0.0252

0.9806

0.9498

0.0008

0.0064

0.0064

0.0369

3

0.0309

0.0055

0.0253

0.9553

0.9244

0.0019

0.0135

0.0071

0.0372

4

0.0309

0.0054

0.0255

0.9298

0.8989

0.0033

0.0214

0.0079

0.0375

5

0.0309

0.0052

0.0256

0.9042

0.8733

0.0051

0.0298

0.0084

0.0374

6

0.0309

0.0051

0.0258

0.8784

0.8475

0.0070

0.0390

0.0093

0.0376

7

0.0309

0.0049

0.0259

0.8525

0.8216

0.0088

0.0492

0.0101

0.0377

8

0.0309

0.0048

0.0261

0.8264

0.7955

0.0107

0.0600

0.0108

0.0376

9

0.0309

0.0046

0.0262

0.8002

0.7693

0.0130

0.0718

0.0118

0.0377

10

0.0309

0.0045

0.0264

0.7738

0.7429

0.0152

0.0825

0.0107

0.0361

11

0.0309

0.0043

0.0265

0.7472

0.7163

0.0172

0.0928

0.0103

0.0352

12

0.0309

0.0042

0.0267

0.7205

0.6896

0.0191

0.1039

0.0111

0.0351

13

0.0309

0.0040

0.0269

0.6937

0.6628

0.0212

0.1168

0.0129

0.0356

14

0.0309

0.0039

0.0270

0.6667

0.6358

0.0235

0.1321

0.0153

0.0363

15

0.0309

0.0037

0.0272

0.6395

0.6086

0.0258

0.1474

0.0153

0.0353

16

0.0309

0.0036

0.0273

0.6122

0.5813

0.0282

0.1631

0.0157

0.0346

17

0.0309

0.0034

0.0275

0.5847

0.5538

0.0305

0.1788

0.0157

0.0336

18

0.0309

0.0032

0.0276

0.5570

0.5261

0.0327

0.1942

0.0154

0.0324

19

0.0309

0.0031

0.0278

0.5292

0.4983

0.0345

0.2096

0.0154

0.0315

20

0.0309

0.0029

0.0280

0.5012

0.4704

0.0363

0.2256

0.0160

0.0308

21

0.0309

0.0027

0.0281

0.4731

0.4422

0.0383

0.2432

0.0176

0.0305

22

0.0309

0.0026

0.0283

0.4448

0.4139

0.0399

0.2599

0.0167

0.0290

23

0.0309

0.0024

0.0285

0.4164

0.3855

0.0414

0.2759

0.0160

0.0278

24

0.0309

0.0022

0.0286

0.3877

0.3569

0.0429

0.2923

0.0163

0.0269

25

0.0309

0.0021

0.0288

0.3589

0.3281

0.0447

0.3098

0.0176

0.0262

26

0.0309

0.0019

0.0290

0.3300

0.2991

0.0469

0.3292

0.0194

0.0257

27

0.0309

0.0017

0.0291

0.3008

0.2700

0.0490

0.3486

0.0193

0.0244

28

0.0309

0.0016

0.0293

0.2715

0.2407

0.0510

0.3681

0.0196

0.0232

29

0.0309

0.0014

0.0295

0.2421

0.2112

0.0528

0.3872

0.0191

0.0219

30

0.0309

0.0012

0.0296

0.2124

0.1815

0.0543

0.4062

0.0189

0.0207

31

0.0309

0.0011

0.0298

0.1826

0.1517

0.0556

0.4249

0.0187

0.0195

32

0.0309

0.0009

0.0300

0.1526

0.1217

0.0567

0.4438

0.0190

0.0183

33

0.0309

0.0007

0.0302

0.1224

0.0916

0.0577

0.4632

0.0194

0.0172

34

0.0309

0.0005

0.0303

0.0921

0.0612

0.0586

0.4819

0.0187

0.0159

35

0.0309

0.0004

0.0305

0.0616

0.0307

0.0594

0.5000

0.0181

0.0147

36

0.0309

0.0002

0.0307

0.0309

0.0000

0.0600

0.5000

0.0000

0.0136

Now I can add the prepayment behavior into our cashflows and compare the expected to the actuals.

lcA %>%
# Perform the following calculation for each year
group_by(issue_year) %>%
# Aggregrate for each year
summarise(
#Number of Loans
Count = n(),
#Dollars Loaned Out
DollarsFunded = sum(funded_amnt),
#Dollars collected from payments
DollarsRecieved = sum(total_rec),
#Percent Charge Offs
Chargeoffs = round(sum(ifelse(grepl("Charged Off",loan_status),1,0))/Count*100,1),
#Percent Prepayments
Prepays = round(sum(ifelse(grepl("Fully Paid",loan_status)&(Age < 36),1,0))/Count*100,1), #The Percent Increase on the Dollars Loans Yield = round(100*(DollarsRecieved/DollarsFunded-1),1), #Dollar weighted interest rate for each year Interest = round(weighted.mean(interest,funded_amnt),1)) %>%
mutate(
#Applys the amortization schedule and sumes payments recieved for each loan
ExpectedYield = apply( (.),
1,
function(x) sum(get_amortization_chargeoff_prepay(36,
as.numeric(x['Interest'])/100,
as.numeric(x['Chargeoffs'])/100,
as.numeric(x['Prepays'])/100
)$payment_recieved))) %>%
mutate(
#Rounds Expected Yield to a percent
ExpectedYield = round(100*(ExpectedYield-1),1)
) %>%
#Filter to relevant variables
select(issue_year,DollarsFunded,Prepays,Chargeoffs,Interest,Yield,ExpectedYield) %>%
#Convert to % for display
mutate(Prepays=paste0(Prepays,"%"),
Chargeoffs=paste0(Chargeoffs,"%"),
Interest=paste0(Interest,"%"),
Yield=paste0(Yield,"%"),
ExpectedYield = paste0(ExpectedYield,"%")) %>%
#Create a data table
kable(format = "markdown")

issue_year

DollarsFunded

Prepays

Chargeoffs

Interest

Yield

ExpectedYield

2008

1899800

49.4%

6%

8.5%

7.6%

8.4%

2009

8700550

49.7%

6.7%

8.8%

7.6%

8.3%

2010

20650050

53.1%

4.4%

7.2%

7.4%

7.4%

2011

49784200

50.9%

6.4%

7.2%

6.4%

6.3%

2012

119545400

52.4%

7.2%

7.7%

6.4%

6.4%

Review: Account For Prepayments & Charge Offs

Now I have a framework for cashflow expectations that comes within error of the actuals. This method requires that I generate exceptions for the expected charge off and prepayment scales to forecast cashflows, but once I have that, the rest is just straight forward math. The cashflow model that comes describes lending club actuals is using the paid as agreed model, then account for charge off and prepayment behavior. Even though there are a number of simplification and assumptions in this framework, it is an improved understanding of the expected cashflows for a grade A, 36 month term loan in Lending Club.

I have been obsessed with online machine learning since I first read Richard Sutton’s Learning to Predict by the Methods of Temporal Difference. The idea of being able to update a model with each new bit of information felt closer to real cognition than traditional supervised methods.

I have also been reading more and more about real-time analytics, and their facilitation by distributed messaging systems like Apache Kafka. Here is a list of companies that using Kafka.

I wanted to spend today getting Kafka running locally, publishing and consuming using python, and run an online model on consumed data from Kafka that was published from a python simulation.

My work flow for today is:

Download, Configure, and Run Kafka (Mac/Linux)

Start Zookeeper

Start two servers

Create Kafka Topics

Write a Simple Python Producer/Consumer

Produce Some Simulation Data

Fit a mod

Setting up Kafka

I downloaded the binary file from http://kafka.apache.org/downloads.html. I used scala 2.11, but I do not believe it matters unless you plan on using scala to interface with Kafka. If and when I do this with Spark, I will given more detailed instructions. Since we’re doing this with python we can use either.

After I down load it I decompress the folder in my home directory and move into the directory.

We are going to be working with the bin and config files. First we need to set up config for our servers so I can have two servers. Because, why not. First we want to make edit the first file and create a log directory.

I have ‘mkdir kafka-log-1’ in my home directory

Now we want to make a copy of the server.properties file than we want to edit the copy.

We are changing the broker id, and the port number

I have also ran ‘mkdir kafka-log-2’ in my home directory.

These three processes are started in the background, but you will see a lot of log messages and warnings display. I’m just going to ignore and move one since they are not blockers..

Creating Topics

Now that Kafka is running locally, we are going to create some topics. One will be for our simple-test. The other for our online-learning.

I am not an expert, and am using kafka-python. It is easily installed using pip.

pip install kafka-python

I want to start with a simple consumer. It is going to iterate through all messages in the topic ‘simple-test’ and print them out. It will wait until I kill the process. FYI: Kafka-python requires that I point them to the Kafka serve addresses:

#simple_consumer.py
from kafka import KafkaConsumer
consumer = KafkaConsumer('online',
group_id='test',
bootstrap_servers="localhost:9091,localhost:9092")
for msg in consumer:
print msg.value

You can run this a the command line:

python simple_consumer.py

Now I want to make a simple producer that adds a message every second to the topic ‘simple-test’.

#simple_producer.py
from kafka import KafkaProducer
from time import sleep
producer = KafkaProducer(
bootstrap_servers='localhost:9091,localhost:9092',
client_id='simple-producer')
for i in range(120):
msg = "Message {}".format(i)
producer.send('simple-test',msg)
sleep(1)

You can run this a the command line at a separate terminal:

python simple_producer.py

When both are running you should see something like this.

You will notice that order is not ensured!

Online-Learning

I am going to run the sklearn’s SGD classifier using its partial_fit functionality. For simplicity, I am going to use this example of SGD but split the data generation and fitting by Kafka, and using partial_fit instead of fit.

I am going to transform all the data into JSON for transmission through Kafka, and then the consumer will have to parse the JSON to update the model.

#data_producer.py
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
import json
from kafka import KafkaProducer
from time import sleep
producer = KafkaProducer(
bootstrap_servers='localhost:9091,localhost:9092',
client_id='online-producer')
X, Y = make_blobs(n_samples=1200,n_features=2,centers=2, cluster_std=3.0)
temp = np.zeros((X.shape[0],X.shape[1]+1))
temp[:,:X.shape[1]]=X
temp[:,X.shape[1]]=Y
for row in temp:
post = json.dumps(dict([(i,x) for i,x in enumerate(row.tolist())]))
producer.send('online',post)
sleep(0.1)

You can start this script with:

python data_producer.py

For the consumer, we will be updating a model and just print out its running accuracy. In theory we can do a lot more, but I’m just playing. Best to keep it simple.

#online_consumer.py
import numpy as np
from sklearn.linear_model import SGDClassifier
from kafka import KafkaConsumer
import json
from sklearn.metrics import roc_auc_score
n_features = 2
clf = SGDClassifier(loss="hinge", alpha=0.01, n_iter=20, fit_intercept=True)
consumer = KafkaConsumer('online',
group_id='online-consumer',
bootstrap_servers="localhost:9091,localhost:9092")
count = 0
correct = 0
for msg in consumer:
data = json.loads(msg.value)
vec = np.zeros(len(data))
for key,value in data.iteritems():
vec[int(key)]=value
features = vec[:n_features].reshape(1,n_features)
target = [vec[n_features]]
if count > 1:
prediction = clf.predict(features)
correct += int(prediction[0]==int(target[0]))
if count > 2:
print correct/float(count-1)
clf.partial_fit(features, target,classes=[0,1])
count += 1

You can start the consumer in one terminal:

python online_consumer.py

And the producer in the other:

python data_producer.py

Clean Up

You need to kill all your processes. I usually ctrl-c the python scripts.

To stop Kafka and Zookeeper you can run the shutdown scripts:

I just wanted to play this morning, and decided to make it a blog post. If this is new to you, I hope you learned something. If you are a master wizard, I would love to get any advice on doing this better. Either way, thanks for reading this far.

I have been dealing with projects that are using datasets that don’t fit in memory, or even on my laptop. I have also been taking a class where a number of classmates reached out for help setting up an ‘Big Data’ system on AWS. The class is having us use Pig, Hive, & Spark to analyze healthcare records and apply machine learning and clustering methods to the data.

I found myself repeating myself on the setup process, so I decided to write a post outlining how I have been working with setting up a system that has Hive, Pig, Spark, Zeppelin installed on AWS.

Zeppelin is a relatively new notebook that works with scala, sql, and more!

Step 0 – Sign up For AWS

I feel like this goes without saying, but experience taught me otherwise. https://aws.amazon.com/

Step 1 – Launch EMR

The first step you need to do is log into your AWS console. Once in you can click on the EMR icon/link.

That will bring you to a page where you can create your own new and shiny cluster.

To install all of the software we want on our cluster we are going to have to go to the advance options.

AWS now supports all the standard big data software out of the box. Just check what you wanted installed. In this case I am make sure I have hadoop 2.7, spark 1.6, hive 1.0, pig 0.14, and Zeppelin 0.5.5.

Click next and you will be taken to the Hardware tab. I always add storage to my EMR clusters. The rule of thumb I use is I tripple the size of the data I am going to load into the cluster. I also make sure I have enough space on the master node to store my data before I load it into the cluster.

I also choose my instance states at this point. For this cluster I am choose r3.xlarge because of the increase memory. Each node is currently $0.09/hour – so this cluster I am make is $0.45/hour. A relatively modest cost to analyze 100Gb of data. You can find the EMR pricing here.

You can click next and give your cluster an identity. This is also a spot where you can choose bootstrap actions that will allow you to include additional software. I will skip this for this post.

Now that you are done with all the steps you create the cluster and wait!

Step 2 – Edit Security Groups

While you are waiting for the cluster to load you are going to want to make sure you have access to the master node. You are going to need to ssh in (port 22), access to Zeppelin (port 8890). From the cluster details window you can click on “security group for Master”

You will want to change the inbound settings so you can contact the server from your IP address.

Click Edit:

Add ssh or custom TCP rules if they do not exist. I recommend you choose the tab (“My IP”) to limit access from your location. Bots to scan AWS ports and can/will connect to open connections. Open up your 22, and 8890 ports.

Now wait until your cluster is ready.

Step 3 – SSH and Load Data

Once your cluster is read and waiting for you, you can ssh into your master node to load your data into the hadoop file system. I keep my data in private s3 buckets, so I will be taking advantage of the AWS Command Line Tool that comes installed with EMR.

This is where you will need your keypair to ssh into your master node. I will show the mac/linux method. Windows users will have to look up the Putty instructions.

FIRST TIME USERS: be sure to chmod 400 <keypair>.pem before you try to connect to your master node.

Now that you are connected to the master node you can copy data into hdfs. I am first going to copy my class data from an S3 bucket to the local master node.

I want to access the data from Zeppelin, and EMR sets up an Zeppelin user. To access the data from the zeppelin notebook you have load it into the right HDFS directory .

The data is now loaded into your hadoop file system in the cluster. It is also accessible from zeppelin.

Step 3 – Use Zeppelin

Now that you have your data loaded you can now take advantage of spark and zeppelin. Open a browser and type in the web-address of your master-node but connect on port 8890. Create a new notebook.

There is a spark context automatically loaded. You can type ‘sc’ to see it. You can also load your data that is in hdfs in the /user/zeppelin location. What is nice about Zeppelin is that you can shell commands, mark down, scala, and python at your finger tips.

Step 4 – Save your Notebooks.

Data is not saved in the EMR cluster, so you need to save out before you terminate your cluster. I always make a new s3 bucket then copy the local notebooks to the s3 bucket.

Now that you have done your big data analysis with your fancy new cluster, you need to terminate your cluster to avoid accruing costs. From your cluster dashboard you can hit terminate.

I always have termination protection on, so I have to turn it off before I terminate.

After a few minutes the cluster will be terminate and your done.

Thank You

I appreciate you taking the time to get this far into my post. Let me know if it was helpful.

While I was reading it i had done of those forehead slap, “No Duh! I’ve been an idiot.” moments that forecast improvement in both execution and understanding when working with sparse datasets. Still very much a noob in data science!

In this post I wanted to just show the update rules for SGD that take advantage of sparse data structures with regularization.

Weight Update Functions

As I have gone through the derivation of the logistic cost function and weight updates with Ridge and LASSO regularization at length in previous posts, so I will just post the update rule here:

Ridge Weight Update

LASSO Weight Update

The is the weight for the ith feature, is the probability that the feature is a positive example, is the learning rate for the update, and is the regularization coefficient.

Stochastic Gradient Decent

Stochastic Gradient Decent (SGD) updates the cost function as each data point comes in, instead of in an iterative batch way. The summation in the above equations is going to go away, and each data point is going to individually update the weights. The new update rules follow.

Ridge SGD Weight Update

LASSO SGD Weight Update

I have rearranged the regularization term to the front of the equation to highlight how this changes when the data set is sparse. When there are several data points with a particular feature zero, the regularization is going to decrease the weight.

Ridge SGD Sparse Weight Update

While a particular feature is zero, its weight should be decaying to zero for each new observation. During this process the equations is the following.

If we have 300,000 features, we would need to perform 300,000 decays each round, even if only 10 or so features are non-zero. The above version highlights that we can lazily update the weights when we see a feature is nonzero as long as we keep track of how long it has been since it was updated.

LASSO SGD Sparse Weight Update

Similarly for LASSO regression we see the following update.

If then the weight is set to zero. This is because weight should not change sign from regularization, and is the clipping that is referred to in the paper.

SVM Light Format

To take advantage of the sparseness in the dataset we need to represent the data in a sparse format. The format of the data is

This row is a positive training example (leading 1) with the 10th feature having a non-zero value of 1.0, the 39th feature having a non-zero value of 1.0, the 431th feature has a value of 0.96, and so on.

This allows a significant compression of data when the data is sparse. This is a common format for text data, and you can download the Dexter Dataset as an example.

Code

I am going to compare two version of the same LogisticRegressionSGD class. One with lazy evaluation and one without lazy evaluation. The code is very similar between the two classes. I am only going to evaluate ridge regression in this post. LASSO extension is equally as straight forward.

Lazy Evaluation

class LogisticRegressionSGD:
def __init__(self, learning_rate, l2_coef):
&quot;&quot;&quot;
Initialization of model parameters
&quot;&quot;&quot;
self.learning_rate = learning_rate
self.weight = collections.defaultdict(float)
self.l2_coef = l2_coef
self.last_update = collections.defaultdict(int)
self.count = 1
def fit(self, X, y):
&quot;&quot;&quot;
Update model using a pair of training sample
&quot;&quot;&quot;
if self.l2_coef &amp;amp;amp;gt; 0:
for f, v in X:
self.weight[f] *= (1-self.l2_coef*self.learning_rate)**(self.count-self.last_update[f])
self.last_update[f] = self.count
diff = y - self.predict_prob(X)
for f, v in X:
self.weight[f] += diff*v*self.learning_rate
self.count += 1
def predict(self, X):
return 1 if self.predict_prob(X) &amp;amp;amp;gt; 0.5 else 0
def predict_prob(self, X):
return 1.0 / (1.0 + math.exp(-math.fsum((self.weight[f]*v for f, v in X))))

Standard Evaluation

import collections
import math
class LogisticRegressionSGDSlow:
def __init__(self,learning_rate, l2_norm):
&quot;&quot;&quot;
Initialization of model parameters
&quot;&quot;&quot;
self.learning_rate = learning_rate
self.weight = collections.defaultdict(float) #[0.0] * n_feature
self.l2_norm = l2_norm
self.last_update = collections.defaultdict(int)
def fit(self, X, y):
&quot;&quot;&quot;
Update model using a pair of training sample
&quot;&quot;&quot;
diff = y - self.predict_prob(X)
if self.l2_norm &amp;amp;amp;gt; 0:
for f in self.weight:
self.weight[f] *= (1-self.l2_norm*self.learning_rate)
for f, v in X:
self.weight[f] += diff*v*self.learning_rate
def predict(self, X):
return 1 if self.predict_prob(X) &amp;amp;amp;gt; 0.5 else 0
def predict_prob(self, X):
return 1.0 / (1.0 + math.exp(-math.fsum((self.weight[f]*v for f, v in X))))

The first class is keeping track of how many examples it has been fed, and using an additionally dictionary to track when the last update to that weight was. The second class iterates through each non-zero weight for each data point and ‘decaying’ it.

Command Line Training

To show and ensure the SGD component of the training, I am going to use the command line piping to load the data into the model. There is a python script to read in stdin and parse the svmlight data. It then trains the SGD logistic mode.

#!/usr/bin/env python
import sys
import random
import pickle
from optparse import OptionParser
from LogisticRegressionSGDSlow import LogisticRegressionSGDSlow
to_float_tuple = lambda x: (int(x[0]),float(x[1]))
if __name__ == '__main__':
parser = OptionParser()
parser.add_option(&quot;-l&quot;, &quot;--learning_rate&quot;, action=&quot;store&quot;,
dest=&quot;learning_rate&quot;, default=0.01, help=&quot;step size&quot;, type=&quot;float&quot;)
parser.add_option(&quot;-c&quot;, &quot;--l2_norm&quot;, action=&quot;store&quot;,
dest=&quot;l2_norm&quot;, default=0.0, help=&quot;regularization strength&quot;, type=&quot;float&quot;)
parser.add_option(&quot;-m&quot;, &quot;--model-path&quot;, action=&quot;store&quot;, dest=&quot;path&quot;,
default=&quot;logSGD.model&quot;, help=&quot;path where trained classifier was saved&quot;)
options, args = parser.parse_args(sys.argv)
model = LogisticRegressionSGDSlow(options.learning_rate, options.l2_norm)
for line in sys.stdin:
row = line.split(&quot; &quot;,1)
y = int(row[0])
features = row[1]
X = [ to_float_tuple(pair.split(&quot;:&quot;)) for pair in features.split() ]
model.fit(X,y)
with open(options.path, &quot;wb&quot;) as f:
pickle.dump(model, f)

Training the model on sparse data with 25k rows and 300k features looks like

time cat data/training.data | python train_data_slow.py -c 0.1 -l 0.1

The time output is:

real 0m27.605s

user 0m27.442s

sys 0m0.070s

This took about 30 second to train in the command light. Using the same code but running on lazy evaluation has a significant speed increase.

time cat data/training.data | python train_data_lazy.py -c 0.1 -l 0.1

real 0m0.959s

user 0m0.844s

sys 0m0.044s

The speed increase in this case is 30x. The results of the model are very similar, though they never will be exactly the same. This is because the updates are different, so the final model with be different.

Final Thoughts

In this post I wanted to review a paper I was exposed to in my class that allowed me to understand how taking advantage of the sparse representation of data can lead to performance increase in training a model. Sklearn in Python and glmnet in R both take advantage of this representation, so there is not any need to do it yourself. I, just out of habit of mind, like to deconstruct these ideas when I have a chance.

Thanks for Reading.

As always, I hope you gained something from reading my musings. Feel free to make comments, or let me know any corrections that need to be made. I have found tons already, but have not caught them all.

This is a math heavy post where I am using the expectation of the cost function for linear regression to discuss the bias-variance tradeoff in model predictions. First I give an example of the bias and variance tradeoff, then I jump into the math to derive the tradeoff.

Linear Regression

Linear regression is a supervised learning technique with the goal of prediction a continuous target value based on a features. One might wish to predict the price a home will sell at, and the features could be the the quality and descriptions of the home, as well as the location and various features of the neighbor. In this example, even if we had the perfect model to predict the price of the home, there are other factors that will come into place to alter the results. The motivation of the seller and buyer, the people that happen to see the home and their negotiation temperament, and other psychological effects. You could think of the price a home will sell at as a rational price plus the irrational change do to the buying/selling process.

Or more abstractly:

The truth depends on a combination of weights and features. We can write it as the following function

where our data has features, and we allow a non-zero intercept .

The large is the more complex our model is, and the stronger it will fit whatever data is used to train it. If the data is not representative of the population it is suppose to represent, the resulting model will be overfitted. It will also vary a lot from dataset to dataset. The less number of features, the less it will overfit, but the more biased the resulting model will be.

I can demonstrate this with a few lines of code in python. I am going to generate some linear data with noise, them I am going to fit the data with a Ridge Linear Regression. The alpha parameters are to limit the model complexity. The large alpha, the less complex the model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as st
%matplotlib inline
from sklearn.linear_model import Ridge
#Fit less complex to more complex models
for alpha in [100,10,1]:
lin = Ridge(alpha=alpha)
xvals = np.linspace(0,10,101)
ytrue = (2.5*xvals+10).reshape(101,1)
ys = []
plt.figure(figsize=(16,8))
#Generate 50 Datasets, fit them, and plot them.
for i in range(50):
x = np.random.randn(10,1)*2+5
y = x*2.5 + 10 + np.random.randn(10,1)*10
lin.fit(x,y)
yp = lin.predict(xvals.reshape(xvals.shape[0],1))
ys.append(yp)
plt.plot(xvals,yp,'b--',alpha=0.1)
ydata = ys[0].copy()
for y in ys[1:]:
ydata += y
ydata = ydata/50
plt.title('Mean Var:{0:.3f} Bias:{1:.3f}'.format(np.array([np.var(y-ydata,axis=0) for y in ys]).mean(),np.var(ytrue-ydata)))
plt.plot(xvals,ydata,'b-')
plt.plot(xvals,2.5*xvals+10,'g-')
plt.ylim(0,50)
plt.show()

The light dashed blue lines are the result of each of the 50 fits for each model. The solid blue line is the average result from all the data sets. The green line is the underlying model, or ground truth.

From top to bottom see see a less complex model has less variation from dataset to dataset, but the average result is biased. The middle model has more variation from dataset to dataset, and is less biases. The bottom model has the most variation from dataset to dataset, but is least biased.

This also illustrated the power of ensemble methods if the underlying model is not biased and the data is sampled in a way that represents the population distribution.

Linear Regression Theory

As mentioned above, Linear regression takes the form:

It is often assumed that the noise is Gaussian (though it does not have to be assumed). The probability density of getting a particular value can be written as:

Since the noise is assumed independent, the probability of getting a set of observation will just to the multiplication of the individual probabilities.

This is called the likelihood function. Linear Regression is finding weights that maximum this likelihood for a given set of observations.

For a host of reasons, it is common to take the natural log of the likelihood, generating the log-likelihood.

The first term is only dependent on the inherit noise of the system/problem and the number of observations. The right had side of the equation is where our model can make a difference. We want the model the makes this term the smallest.

Each of the dashed blue lines in the first figure is a solution to this problem. Lets now look at the expected cost.

Cost & Expected Cost

The solution of Linear Regression is the model with the least sum squared difference between the model prediction and the observed values. This is commonly called the cost or loss function. Different machine learning methodologies use different cost functions.

We showed above that the cost function for basic linear regression is:

Where truth has been dropped because we do not know the truth. We are using a model to predict what we observe.

What is observed is expected to be a continuous range that takes a gaussian distributions around the true value. We also expect that we will see a particular set of data in a data set. I want to define the probability density of getting a particular target and feature as:

We can discuss the expected cost of a model then as the integral over all possible data and all possible observed values:

Obviously we want to choose the model that will give the lowest expected cost. We can differentiate each side of the expected cost and set it to zero to find the model that does this. We will call the solution the optimal model

Doing a quit bit of algebra allows us to rewrite the above equation as below.

This highlights that:

Then the optimal solution is:

where

We can make the following simplification by using the the following probablity rule:

Optimal Model

Types of Cost

Now that we have a way to find the optimal model we can rewrite the expected cost.

It is relatively easy to show the last/bottom term is zero. That that leads to two types of expected costs. The first term is the irreducible noise.

No matter how good your model, even if it is optimal, there will be observations that different from the ground truth because of the noise. This is the buying and selling processing in selling of a house that will move the selling price from the rational price.

The second term is the model error;

The result only depends on the distribution of all possible features.

Cost From Data

We are not given a model, but instead derive a model from data. The model will always be different if it is derived from different data. We will notate this where is a given data set.

We can then write the expected model error as:

Where the model now depends on the dataset observed, and we integrate over all possible datasets. We can define the expected model based on all possible datasets as:

As before we can add and subtract the term in expected error.

Also as before it is fairly easy to show the last term is zero.

The expected model cost then results from two terms. The first/top term is the expected difference between the optimal model and the model expected to be produced by the data. This is the squared bias associated with the expected model.

The second/bottom term in the model error is the difference from each possible model and the expected model generated from the data. This is the varance of the model from the expected model.

Bias-Variance Tradeoff

The expected cost is then given by:

The models that we fit on data is somewhere on the spectrum of high biased or high variance. If we refit the model on new data, we will either be getting a lot of variation from model to model, get a biased result that has less variation. In practice this is not very useful because we have one data set.

It is also worth noting that the tradeoff is less of a trade off with more data. Here is the same plots for 10x the data as before.

Final Thoughts

I think this is interesting, and useful for a deeper theoretical understanding of model fitting, but may not be practical. This understanding is not directly applicable in a single dataset building a model. It might be useful in meta analysis doing a post-mortem on a model is in use. It is definitely useful in trying to understand Bayesian Regression Techniques. Likely to be another post in the near future.

Thank you

As always, thank you for reading. Hope it was useful and/or interesting. Feel free to share.

I have been talking about performance metrics recently, and I wanted to post/review on something from from chapter 8 of Data Science For Business.

Profit curves are a way of giving a benefit value to correct predictions and cost value to incorrect predictions. I really like these because it gives a way to evaluate the business impact of a model. It is often that we can gain epsilon increases in performance, but it may not make a meaningful difference in the bottom line. Profit curves allow us to see this.

Profit curves also allow us to estimate the optimal threshold, or what fraction of the strongest predictions, to predict positive to maximize the business impact.

Also, profit curves have constructed in a similar way to ROC curves. ROC plots show the True Positive Rate and False Positive Rate in order of prediction strength. Profit curves show expected profit in order of prediction strength.

Confusion Matrix

The confusion matrix show the count of target classifications vs prediction classifications in a matrix. The True Positive Rates and False Positive Rates are metrics that are target based metrics. This means that they are normalized by the count of either total target positives or total target negatives.

An ROC curve would plot the top row of the right-hand matrix in order of prediction strength.

We have the above confusion matrix and normalize by the target values. It allows us to estimate the rate that we get a given prediction is correct or incorrect. The next step in constructing a profit curve is to normalize the matrix with respect to the population proportions of positive and negative targets. This will allow us to get a feel for the actual rate of misclassification and correct classification in the populations we are concerned with if we know the population proportions and .

Profit Matrix

Profit curves are created by giving each term in the above altered confusion matrix with a numerical value. A cost or benefit.

For illustrations of the calculations, and to develop some intuition, I will use the following example:

This matrix says a correct positive prediction is worth $10, a correct negative prediction is worth $6, and a misclassification costs $2.

Profit Calculations

I want to do two toy calculations using the example matrix. One case is where the model predicts everyone positive. and are the count of positive and negative targets in the population.

The 1,1 illustrate that this is the upper right corner of an ROC curve.

If we have a population that has the proportion of positive and negative examples of and , we have the error rate that looks like:

The element wise multipication with our cost matrix gives:

We sum all the elements of this matrix together to get the expected profit per prediction:

This strategy is profitable. If we look at the other extreme and predict everything is negative we get the following results:

The element wise multiplication with our cost matrix produces:

I have two extreme predictions. In one case I predict everything to be positive, and make $2 per prediction. The other case is I predict everything to be negative, and make $3.33 per prediction.

Models can help improve our predictions, and we would like to do two things.

Forecast increase in profits

Choose the model parameters that optimize profits

Flower Shop

For this post I am going to make a profit curve for a flower shop that sells at two locations, but I am trying to automate the shipping. If we can get our best flowers to the right location, we will make 10 dollars, but if we have to pay 2 dollars in shipping to the correct place if the prediction is incorrect. If we get the other flowers to the other location, we get 6 dollars, but again we have to pay 2 dollars in shipping if this prediction is incorrect. The cost benefit matrix is what I have used above.

We already know the two extremes in profit, but lets fit some models on the iris data set and make some profit curves.

Model

I am using the Iris data set fit with two models: Logistics Regression and Support Vector Machines. I have hamstrung both models because I am not performing a train/test split (for the interest in focus on making profit curves) and to artificially give both models the same Area under the ROC Curve (AUC).

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,roc_curve, roc_auc_score
%matplotlib inline
iris = datasets.load_iris()
X = iris.data # we only take the first two features.
Y = (iris.target==1).astype(int)
svm = SVC(C=.1,probability=True,kernel='linear')
log = LogisticRegression(C=1)
log.fit(X,Y)
svm.fit(X,Y)
y_pred = log.predict(X)
probs = []
p = log.predict_proba(X)[:,1]
probs.append(p)
fpr, tpr, thresholds = roc_curve(Y,p)
plt.figure(figsize=(16,8))
plt.plot(fpr,tpr,label='Log ROC Curve - AUC: %.2f' % roc_auc_score(Y,p),color='steelblue',lw=2)
y_pred = svm.predict(X)
p = svm.predict_proba(X)[:,1]
probs.append(p)
fpr, tpr, thresholds = roc_curve(Y,p)
plt.plot(fpr,tpr,label='SVM ROC Curve - AUC: %.2f' % roc_auc_score(Y,p),color='firebrick',lw=2)
plt.xlabel(&amp;amp;quot;False Positive Rate&amp;amp;quot;)
plt.ylabel(&amp;amp;quot;True Positive Rate&amp;amp;quot;)
plt.legend()
plt.show()

The SVM has a higher true positive rate for lower false positive rates, while the logistic regression model has a higher true positive rate for larger false positive rates. Everything predicted positive in the upper right, and everything predicted negative is the lower left.

The AUC is the same (and even if it was not) so we need a way to evaluate the business impact.

Profit Curve Calculations

Now that I have models, and I have scored all the p values, We can look a the confusion matrix. Because I am using sklearn’s method I want to point out that the confusion matrix has the following format:

The rates then become:

The profit matrix is multiplied element wise with the above matrix. The results are:

We now sum each element in the matrix and get the expected profit.

Making Profit Curves

To make the profit curve we now just have to go through each prediction ordered by strongest predictions to our weakest predictions. For each prediction, will say everything as strong or stronger than that prediction is positive, and everything weaker than that prediction is negative. We then calculate the expected profit as shown above.

names = ['Log','SVM']
colors = ['steelblue','firebrick']
plt.figure(figsize=(16,8))
for i,p in enumerate(probs):
order = np.argsort(p)
cost_vals = []
for pp in p[order]:
cm = confusion_matrix(Y,(p&amp;amp;gt;=pp))
#Make Rates
cmr = cm/cm.sum(axis=1).astype(float).reshape(2,1)
#Multiply by target proportions
acmr = cmr*(cm.sum(axis=1).astype(float)/cm.sum()).reshape(2,1)
#Elementwise multiplication with cost matrix
cost_vals.append((acmr*cost).sum())
plt.plot(np.array(range(len(p),0,-1)).astype(float)/len(p),cost_vals,label='%s Curve - AUC: %.2f - Max Profit %.2f' % (names[i], roc_auc_score(Y,p),max(cost_vals)),color=colors[i],lw=2)
plt.xlabel(&amp;amp;quot;Percent Predicted Positive&amp;amp;quot;)
plt.ylabel(&amp;amp;quot;Expected Profit Per Prediction&amp;amp;quot;)
plt.legend()
plt.show()

The profit curve shows that logistic regression has a slightly higher profit than the SVM model, but in reality we should having train/test splits and boot strap errors. At a first order estimate this gives a feel for the business value a model can have. In this case, it allows us to increase profits significantly.

Final Thoughts

There are other business considerations that have be considered in model choice, such as technical debt, model debt, and data debt. Models do have to be maintained in practice.

While at Galvanize we did a churn analysis similar to this, with a costly intervention strategy for a cell phone provider. The best performing models are also the most costly to implement, train, and maintain. Using the wrong thresholds can also cost money.

There is significant gains to be had in this case, but the human and infrastructure costs have to be considered as well.

Thank You

As all ways, thanks for reading. Hope you enjoyed or gained something.

This is the final post in implementing logistic regression from scratch. The initial motivation for this series was I wanted to post on the theory and basic implementation of logistic regression, but also show that there are cases where one might not want to use the default models provided by a package. In this post I am going give a toy case for this scenario using a wine data set.

My thought process is there are contexts within companies where there are multiple analysis that lead to baseline expectations using separate datasets. In my company we have places where there are multiple distict sources for information that can be used to build the same metrics or models. I imagine this is the case for most data-driven companies. Context always improves the usefulness of information and decision making, and finding ways to incorporate context into models always improves generalization of the results.

In this post I am considering a scenario where there are results from another analysis that set a baseline that I want to be incorporated in the model, but the training data can not be joined with the data used for the previous analysis. In this contexts, the previous analysis is setting a prior. I want the prior to change the logistic cost function so that the results include information about the baseline.

Results of a Previous Analysis

Imagine the contrived example that someone looked at a worldwide sample of wines, and found that the amount of free sulfer dioxide (a perservative) in the wine was related to how likely a wine was to be red than white. The more free sulfur dioxide, the more likely it was to be white. It turned out that the log odds of being red was approximatey linear with respect to the concentration of the free sulfur dioxide.

Using the national sample the estimate on the coefficient was estimated to be approximately gaussian with the following results:

These results can be used to construct a prior on the coefficient for this weight (but not other weights)

x = np.linspace(-0.10,-0.20,1000)
y = norm.pdf(x,loc=-0.14,scale=0.005)
plt.figure(figsize=(16,8))
plt.plot(x,y,color="#AA3333")
plt.ylim(y.min(),y.max()+1)
plt.xlabel('Coefficient Value')
plt.ylabel('Probability Density')
plt.show()

Ridge Regularization

In the previous posts I covered the theory of Regularization and I implemented LASSO (Laplace Prior) and Ridge (Gaussian Prior) Regularization. In both cases the assumption is all coefficients have the same prior distribution centered around zero. In this case we have a prior on only 1 coefficent, and it is not cented at zero. We can generalize the prior(s) to include this.

Updating Equations

The previous prior we used assumed each weight was centered at zero, and all weights at the same distribution:

If we assume the possibility that the expected weight for each weight is different, and the expected standard deviation is different, this changes to:

That means the negative log prior is:

To get an update rule to include this prior(s) we take the derivative:

Including the logisitic weight update rule, we get the following equation:

The convention is the same as in similar posts: is the the coefficient for the ith feature, is the intercept, is the learning rate, is the expected value, and the is the standard deviation of the estimate of

Our Data

I have a sample of wines from South America, and I want to build a model that will estimate the probability of being a red wine for all wines based on the concentration of free sulfur dioxide. I have a biased sample, so in the effort to generalize I want to include the prior. There are a number of issues that I am going to ignore.

In trying to estimate the out of sample performance, the model without the prior will likely have better performance. A train/test split will both be biased south american samples, not samples that include the distribution of wines in the world.

I have no way to test on a generalize population. I would have to release it into the wild, and monitor its performance.

With enough data, the model will fit to the sample and ignore the prior.

In the previous posts in this series I wrote classes for logistic regression, but in this post I am going to do it simply with 3 functions. The first function is the probability estimate using the weights, the second function is the value of the cost (negative log likelihood ) function, and the third function is the gradient of the cost function.

The weight column vector will be a feature vector with the zeroth position is the intercept, and other weights are the slopes for the features. The X is a matrix where the first column is ‘1’ for the intercept. The y is a column vector of labels. The l2 is a row vector of regularization parameters. This allows us to set a different parameter for each feature. The ec is the expected value from the prior in a column vector.

For our case I am going to use the prior for l2 and ec:

In a pervious post on regularization, I examined the cost function in the parameter space. It was easy to see the eventual solution of the parameters that solution would come to. I showed this with sklearn in that post. In this post lets look at what would happen to the solution if we did traditional ridge regression with a regularization parameter of 4000 (the equivalent value of .

I want to load a random sample of 500 wines:

indexes = np.random.choice(wine.shape[0],500)
data = wine.loc[indexes,['free sulfur dioxide']].values
labels = (wine.loc[indexes,'Type'].values=='Red').astype(int)
X = np.hstack((np.ones((data.size,1)),data))
y = labels.reshape((labels.size,1))

Here is the code the contour plot of the cost function is below:

t1 = np.linspace(0,5,100)
t2 = np.linspace(-.3,.0,200)
ecz = np.zeros((2,1))
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
plt.figure(figsize=(16,8))
plt.subplot(121)
l2_temp = np.zeros(2)
ax = plt.gca()
for i,w0 in enumerate(t1):
for j,w1 in enumerate(t2):
w = np.array([w0,w1]).reshape(2,1)
Z[j,i] = cost(w,X,y,l2_temp,ecz)
levs = np.linspace(np.round(Z.min()+50),np.round(Z.max()-50),6)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Blues_r)
#CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.subplot(122)
l2_temp = np.array([4000.,4000.])
for i,w0 in enumerate(t1):
for j,w1 in enumerate(t2):
w = np.array([w0,w1]).reshape(2,1)
Z[j,i] = cost(w,X,y,l2_temp,ecz)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Reds_r)
#CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Posterier')
plt.show()

Here is the resulting plot:

The left plot is a contour plot of the log likelihood, and the right plot is a contour plot of the posterior after a gaussian prior on the weights centered at zero. The traditional regulation will pull to the coefficients from about (2,-0.12) to (0,-0.05). Running this data into sklearn should confirm this:

model = LogisticRegression(C=1e25)
model.fit(data,labels)
print "Unregularized Sklearn: ", model.intercept_[0],model.coef_[0,0]
model = LogisticRegression(C=1/4000.)
model.fit(data,labels)
print "Regularize Sklearn: ", model.intercept_[0],model.coef_[0,0]

Instead of applying the traditional prior, I want to apply the experimental prior on our likelihood and find the minimum of the posterior function. First, I want to look at the cost function and posterior function.

t1 = np.linspace(0,5,100)
t2 = np.linspace(-.3,.0,200)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
plt.figure(figsize=(16,8))
plt.subplot(121)
l2_temp = np.array([0,0])
ax = plt.gca()
for i,w0 in enumerate(t1):
for j,w1 in enumerate(t2):
w = np.array([w0,w1]).reshape(2,1)
Z[j,i] = cost(w,X,y,l2_temp,ec)
levs = np.linspace(np.round(Z.min()+50),np.round(Z.max()-50),6)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.subplot(122)
for i,w0 in enumerate(t1):
for j,w1 in enumerate(t2):
w = np.array([w0,w1]).reshape(2,1)
Z[j,i] = cost(w,X,y,l2,ec)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Reds_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Posterior')
plt.show()

The left plot is the same plot as above, but the right plot is now the posterior after applying a regularization on only the total sulfur dioxide coefficient. Sklearn and other packages does not have a way to solve this situation. That is why I made my own function above.

Solving the Custom Cost Function

I have shown gradient decent a numerous number of times in this series. This will be the final time!

w = np.zeros((2,1))
i = 0
while np.linalg.norm(grad(w,X,y,l2,ec)) > 1e-6:
if i % 10000 == 0:
print cost(w,X,y,l2,ec)[0],np.linalg.norm(grad(w,X,y,l2,ec))
w = w + 1e-5*grad(w,X,y,l2,ec)
i += 1
print "Custom Results: ", w[0,0],w[1,0]

These weights represent coefficients that can not be found using logistic regression with the data provided, with or without regularization. Instead, it reflect the optimal weights given the combination of the previous analysis and the data provided.

Review

In this post I showed how we can make a simple alteration to the cost function to allow priors for non-zero weights and different regularization coefficients. Using that altered function, I used the results from a prior analysis as the prior for the negative log likelihood constructed from the data. The results lead to coefficients that are not capable of generating from packages like sklearn, but could produce results that generalize better.

Thank You

I appreciate you taking the time to read this post. I hope you gained something from it. Feel free to reach out with any questions in the comments or directly through the email link in the menu.