# Summary

This is a math heavy post where I am using the expectation of the cost function for linear regression to discuss the bias-variance tradeoff in model predictions.   First I give an example of the bias and variance tradeoff, then I jump into the math to derive the tradeoff.

# Linear Regression

Linear regression is a supervised learning technique with the goal of prediction a continuous target value based on a features. One might wish to predict the price a home will sell at, and the features could be the the quality and descriptions of the home, as well as the location and various features of the neighbor. In this example, even if we had the perfect model to predict the price of the home, there are other factors that will come into place to alter the results. The motivation of the seller and buyer, the people that happen to see the home and their negotiation temperament, and other psychological effects. You could think of the price a home will sell at as a rational price plus the irrational change do to the buying/selling process. $\Large{ \textrm{Selling Price} = \textrm{True Price} + \textrm{Noise} }$

Or more abstractly: $\Large{ y_{Obs} = y_{Truth} + \epsilon }$

The truth depends on a combination of weights and features. We can write it as the following function $\Large{ y_{Truth}\left(x,w\right) = w^0 + \sum_{i=1}^{M} w^i \cdot x_i }$ $\Large{ w^i = \textrm{Weight for ith Feature} }$ $\Large{ x_i = \textrm{ith Feature} }$

where our data has $M$ features, and we allow a non-zero intercept $w^0$.

The large $M$ is the more complex our model is, and the stronger it will fit whatever data is used to train it.  If the data is not representative of the population it is suppose to represent, the resulting model will be overfitted.  It will also vary a lot from dataset to dataset.   The less number of features, the less it will overfit, but the more biased the resulting model will be.

I can demonstrate this with a few lines of code in python.  I am going to generate some linear data with noise, them I am going to fit the data with a Ridge Linear Regression.  The alpha parameters are to limit the model complexity.  The large alpha, the less complex the model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as st
%matplotlib inline
from sklearn.linear_model import Ridge

#Fit less complex to more complex models
for alpha in [100,10,1]:
lin = Ridge(alpha=alpha)
xvals = np.linspace(0,10,101)
ytrue = (2.5*xvals+10).reshape(101,1)
ys = []
plt.figure(figsize=(16,8))
#Generate 50 Datasets, fit them, and plot them.
for i in range(50):
x = np.random.randn(10,1)*2+5
y = x*2.5 + 10 + np.random.randn(10,1)*10
lin.fit(x,y)
yp = lin.predict(xvals.reshape(xvals.shape,1))
ys.append(yp)
plt.plot(xvals,yp,'b--',alpha=0.1)

ydata = ys.copy()
for y in ys[1:]:
ydata += y
ydata = ydata/50
plt.title('Mean Var:{0:.3f}      Bias:{1:.3f}'.format(np.array([np.var(y-ydata,axis=0) for y in ys]).mean(),np.var(ytrue-ydata)))
plt.plot(xvals,ydata,'b-')
plt.plot(xvals,2.5*xvals+10,'g-')
plt.ylim(0,50)
plt.show() The light dashed blue lines are the result of each of the 50 fits for each model.  The solid blue line is the average result from all the data sets.  The green line is the underlying model, or ground truth.

From top to bottom see see a less complex model has less variation from dataset to dataset, but the average result is biased.  The middle model has more variation from dataset to dataset, and is less biases.   The bottom model has the most variation from dataset to dataset, but is least biased.

This also illustrated the power of ensemble methods if the underlying model is not biased and the data is sampled in a way that represents the population distribution.

# Linear Regression Theory

As mentioned above, Linear regression takes the form: $\Large{ y_{Obs} = y_{Truth}(x,w)+ \epsilon }$

It is often assumed that the noise is Gaussian (though it does not have to be assumed). The probability density of getting a particular value can be written as: $\Large{ p\left(y_{Obs} | x,w,\sigma_{\textrm{Noise}}\right) = \frac{1}{\sqrt{2 \pi \sigma^2_{\textrm{Noise}}}} exp \left[ \frac{-\left( y_{Obs} - y_{truth}\left(x,w\right) \right)^2}{2 \sigma^2_{\textrm{Noise}}} \right] }$

Since the noise is assumed independent, the probability of getting a set of $N$ observation will just to the multiplication of the individual probabilities. $\Large{ p\left(\bf{y_{Obs}}\right) = \prod_{n=1}^{N} p\left(y^n_{Obs} | x^n,w,\sigma_{\textrm{Noise}}\right) }$

This is called the likelihood function. Linear Regression is finding weights that maximum this likelihood for a given set of observations.

For a host of reasons, it is common to take the natural log of the likelihood, generating the log-likelihood. $\Large{ ln \left[ p\left(\bf{y_{Obs}}\right) \right] = \sum_{n=1}^{N} ln\left[p\left(y^n_{Obs} | x^n,w,\sigma_{\textrm{Noise}}\right) \right] }$ $\Large{ ln \left[ p\left(\bf{y_{Obs}}\right) \right] = \sum_{n=1}^{N} \frac{-1}{2} ln\left[2 \pi \sigma_{\textrm{Noise}}^2\right] + \sum_{n=1}^{N} \frac{-\left( y^n_{Obs} - y_{truth}\left(x^n,w\right) \right)^2}{2 \sigma^2_{\textrm{Noise}}} }$ $\Large{ ln \left[ p\left(\bf{y_{Obs}}\right) \right] = -\frac{N}{2} ln\left[2 \pi \sigma_{\textrm{Noise}}^2\right] + \frac{-1}{2 \sigma^2_{\textrm{Noise}}} \sum_{n=1}^{N} \left( y^n_{Obs} - y_{truth}\left(x^n,w\right) \right)^2 }$

The first term is only dependent on the inherit noise of the system/problem and the number of observations. The right had side of the equation is where our model can make a difference. We want the model the makes this term the smallest. $\Large{ \textrm{Solution} = min_{w} \sum_{n=1}^{N} \left( y^n_{Obs} - y_{truth}\left(x^n,w\right) \right)^2 }$

Each of the dashed blue lines in the first figure is a solution to this problem.  Lets now look at the expected cost.

# Cost & Expected Cost

The solution of Linear Regression is the model with the least sum squared difference between the model prediction and the observed values. This is commonly called the cost or loss function. Different machine learning methodologies use different cost functions.

We showed above that the cost function for basic linear regression is: $\Large{ Cost = \sum_{n=1}^{N} \left( y^n_{Obs} - y\left(x^n,w\right) \right)^2 }$

Where truth has been dropped because we do not know the truth. We are using a model $y\left(x,w\right)$ to predict what we observe.

What is observed is expected to be a continuous range that takes a gaussian distributions around the true value. We also expect that we will see a particular set of data in a data set. I want to define the probability density of getting a particular target and feature as: $\Large{ p\left(x,y_{obs} \right) = \textrm{Probability of observing features and value} }$

We can discuss the expected cost of a model then as the integral over all possible data and all possible observed values: $\Large{ E\left[ Cost \right] = \int_{x} \int_{y_{obs}} \left( y_{Obs} - y\left(x,w\right) \right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }$

Obviously we want to choose the model $y\left(x,w\right)$ that will give the lowest expected cost. We can differentiate each side of the expected cost and set it to zero to find the model that does this. We will call the solution the optimal model $y_{Optimal}$ $\Large{ \frac{\delta E\left[ Cost \right]}{\delta y\left(x,w\right)} = 2 \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}\left(x,w\right) \right) \cdot p\left(x,y_{Obs} \right) \cdot \delta y\left(x,w\right) dy_{Obs} dx = 0 }$

Doing a quit bit of algebra allows us to rewrite the above equation as below. $\Large{ \frac{\delta E\left[ Cost \right]}{\delta y\left(x,w\right)} = 2 \int_{x} \left[ \int_{y_{obs}} y_{Obs} \cdot p\left(x,y_{Obs}\right) dy_{Obs} - y_{Optimal}\left(x,w\right) \int_{y_{obs}} p\left(x,y_{Obs}\right) dy_{Obs} \right] \delta y\left(x,w\right) dx = 0 }$

This highlights that: $\Large{ \int_{y_{obs}} y_{Obs} \cdot p\left(x,y_{Obs}\right) - y_{Optimal}\left(x,w\right) \int_{y_{obs}} p\left(x,y_{Obs}\right) dy_{Obs} = 0 }$

Then the optimal solution is: $\Large{ y_{Optimal}(x) = \frac{1}{p\left(x\right)} \cdot \int_{y_{obs}} y_{Obs} \cdot p\left(x,y_{Obs}\right) dy_{Obs} }$

where $\Large{ p\left(x\right) = \int_{y_{obs}} p\left(x,y_{Obs}\right) dy_{Obs} }$

We can make the following simplification by using the the following probablity rule: $\Large{ p(x,y_{Obs}) = p(y_{Obs}|x) \cdot p(x) }$

## Optimal Model $\Large{ y_{Optimal}(x) = \int_{y_{obs}} y_{Obs} \cdot p\left(y_{Obs} | x\right) dy_{Obs} }$

# Types of Cost

Now that we have a way to find the optimal model we can rewrite the expected cost. $\large{ E\left[ Cost \right] = \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}(x) + y_{Optimal}(x) - y\left(x,w\right) \right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }$ $\Large{ E\left[ Cost \right] = \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}\right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }$ $\Large{ + \int_{x} \int_{y_{obs}} \left(y_{Optimal}(x) - y\left(x,w\right) \right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }$ $\Large{ + \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}(x)\right) \cdot \left(y_{Optimal} - y\left(x,w\right) \right) p\left(x,y_{Obs} \right) dy_{Obs} dx }$

It is relatively easy to show the last/bottom term is zero. That that leads to two types of expected costs. The first term is the irreducible noise. $\large{ \textrm{Expected Irreducible Noise} = \int_{x} \int_{y_{obs}} \left( y_{Obs} - y_{Optimal}(x)\right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }$

No matter how good your model, even if it is optimal, there will be observations that different from the ground truth because of the noise. This is the buying and selling processing in selling of a house that will move the selling price from the rational price.

The second term is the model error; $\Large{ \textrm{Expected Model Error} = \int_{x} \int_{y_{obs}} \left(y_{Optimal}(x) - y\left(x,w\right) \right)^2 p\left(x,y_{Obs} \right) dy_{Obs} dx }$ $\Large{ = \int_{x} \left(y_{Optimal}(x) - y\left(x,w\right) \right)^2 p\left(x\right) dx }$

The result only depends on the distribution of all possible features.

# Cost From Data

We are not given a model, but instead derive a model from data. The model will always be different if it is derived from different data. We will notate this $y(x,w|D)$ where $D$ is a given data set.

We can then write the expected model error as: $\Large{ \textrm{Expected Model Cost} = \int_{x} \int_{D} \left(y_{Optimal}(x) - y\left(x,w|D\right) \right)^2 p\left(x,D\right) dx \ dD }$

Where the model now depends on the dataset observed, and we integrate over all possible datasets. We can define the expected model based on all possible datasets as: $\Large{ y_{Data}(x,w) = \int_{D} y\left(x,w|D\right) p\left(x,D\right) dD }$

As before we can add and subtract the term in expected error. $\Large{ \textrm{Expected Model Cost} = \int_{x} \int_{D} \left(y_{Optimal}(x) -y_{Data}(x) + y_{Data}(x) - y\left(x,w|D\right) \right)^2 p\left(x,D\right) dx \ dD }$ $\Large{ \textrm{Expected Model Cost} = \int_{x} \int_{D} \left(y_{Optimal}(x) - y_{Data}(x)\right)^2 p\left(x,D\right) dx \ dD }$ $\Large{ + \int_{x} \int_{D} \left(y_{data}(x) - y(x,w|D)\right)^2 p\left(x,D\right) dx \ dD }$ $\Large{ + \int_{x} \int_{D} \left(y_{Optimal}(x) - y_{Data}(x)\right) \cdot \left(y_{data}(x) - y(x,w|D)\right) p\left(x,D\right) dx \ dD }$

Also as before it is fairly easy to show the last term is zero. $\Large{ \textrm{Expected Model Cost} = \int_{x} \left(y_{Optimal}(x) - y_{Data}(x)\right)^2 p\left(x\right) dx \\ + \int_{x} \int_{D} \left(y_{data}(x) - y(x,w|D)\right)^2 p\left(x,D\right) dx \ dD }$

The expected model cost then results from two terms. The first/top term is the expected difference between the optimal model and the model expected to be produced by the data. This is the squared bias associated with the expected model. $\Large{ \textrm{Expected Squared Model Bias} = \int_{x} \left(y_{Optimal}(x) - y_{Data}(x)\right)^2 p\left(x\right) dx }$

The second/bottom term in the model error is the difference from each possible model and the expected model generated from the data. This is the varance of the model from the expected model. $\Large{ \textrm{Expected Model Variance} = \int_{x} \int_{D} \left(y_{data}(x) - y(x,w|D)\right)^2 p\left(x,D\right) dx \ dD }$

The expected cost is then given by: $\Large{ E\left[Cost\right] = Irreducible Noise + Expected Square Model Bias + Expected Model Variance }$

The models that we fit on data is somewhere on the spectrum of high biased or high variance.   If we refit the model on new data, we will either be getting a lot of variation from model to model, get a biased result that has less variation.   In practice this is not very useful because we have one data set.

It is also worth noting that the tradeoff is less of a trade off with more data.   Here is the same plots for 10x the data as before. # Final Thoughts

I think this is interesting, and useful for a deeper theoretical understanding of model fitting, but may not be practical.  This understanding is not directly applicable in a single dataset building a model.  It might be useful in meta analysis doing a post-mortem on a model is in use.  It is definitely useful in trying to understand Bayesian Regression Techniques.  Likely to be another post in the near future.

# Thank you

As always, thank you for reading.   Hope it was useful and/or interesting.  Feel free to share.