Imagine you've got two normally distributed random variables,

*x*and

*y*. Here is their relationship:

*x*provides little information about

*y*. That is, there is no clear predictive relationship. Certainly not a linear one.

We can test this by trying to fit a linear model (

*y = ax + b*) to the data (the red line). This procedure shows us that the variance in

*x*accounts for less than 1% of the variance in

*y*.

But let's imagine you can't collect all of the data. Instead your sample includes only

*two*of those data points. Again you try and fit a linear model and...

*y = ax + b*) explains

**100% of the variance**! Now you might say, "my linear model is

*perfect*and completely explains the relationship between

*x*and

*y*". But wait... that fitting estimate doesn't look the same as the linear fit when we've got the full data:

**That**is overfitting. When your model has too many parameters relative to the number of data points, you're prone to overestimate the utility of your model.

We can keep going with this:

*y = ax + b*) once again performs poorly (~7% of the variance) but our new model, a quadratic model (

*y = ax^2 + bx + c*) explains

**100% of the variance**.

Again, sampling one more data point:

*and*quadratic models both perform poorly, but our

*new*new model, a cubic model (

*y = ax^3 + bx^2 + cx + d*), explains

**100% of the variance**.

And so on.

## No comments:

## Post a Comment