As someone who was trained in nonlinear dynamics, I never gave much thought to linear regression. After all, what could be more boring than fitting data with a straight line. Now I use it all the time and find it rather beautiful. I’ll start with the simplest example and show how it generalizes easily. Consider a list of ordered pairs and you want to fit a straight line through the points. You want to find parameters such that

(1)

has the smallest errors where smallest usually, although not always, means in the least squares sense.

Hence, you want to minimize

(2)

Setting the derivatives of with respect to and to zero gives

where and and . The easy way to remember this is that the slope is equal to the covariance of x and y divided by the variance of x, e.g. and you get by taking the average of equation (1) and solving for .

Recall that the Pearson correlation coefficient is given by so that . Thus the correlation coefficient between x and y is given by the slope of the linear regression multiplied by the ratio of the standard deviations. (The square of is the fraction of the variance explained by the regression.) Now we can see where the term regression comes from. If we “standardize” equation (1) (i.e. subtract the mean of y and divide by the standard deviation) and use the fact that we get

, (3)

where and . What (3) implies is that the deviation from the mean in units of standard deviation of is always less than or equal to the standardized deviation from the mean of , since .

Thus y has “regressed to the mean”. For example, suppose y represents daughters and x represents mothers. Then what this says is that the daughter will always be closer to the mean than the mother in terms of the respective standard deviations of daughters and mothers. However, even though families tend to regress to the mean this does not imply that the variance of the distribution has to shrink in each generation. The variance and mean of each generation could decrease, stay the same or increase. It’s just where you are with respect to the population that changes.

There is also a linear algebra way of looking at linear regression, which is useful if you want to generalize to higher dimensions. We can always generalize equation (1) to

where Y is a dimensional matrix of data points, X is a dimensional matrix of independent variables (i.e. regressors) and B is a dimensional matrix of parameters. For equation (1) Y would be a vector with elements , X would be the matrix of all ones in the first column and elements in the second column and B would be the vector . In this form we see that we want to “invert this equation” to obtain the parameters B. However, generally X is not an invertible square matrix so to solve this problem we multiply both sides by the transpose of X and then take the inverse giving

This then is the only formula you have to remember. In fact this is just the generalized version of the one dimensional formula where is the covariance between X and Y and is the covariance/variance matrix of X.

Typo corrected April 1, 2011

I love this – it’s the most concise course on multiple regression anybody will ever pen. One tiny glitch, though, ought to be fixed:

“and you get b_0 by taking the average of equation (2) and solving for b_0”

should instead refer to equation (1).

LikeLike

Thanks for the correction.

LikeLike

[…] the original post on MCMC. With a linear model, you can write down the answer in closed form (see here), so it is a good model to test your algorithm and code. Here it is in pseudo-Julia […]

LikeLike