# Linear Regression

As someone who was trained in nonlinear dynamics, I never gave much thought to linear regression.  After all, what could be more boring than fitting data with a straight line.  Now I use it all the time and find it rather beautiful.  I’ll start with the simplest example and show how it generalizes easily.  Consider a list of $N$ ordered pairs $(x_i,y_i)$ and you want to fit a straight line through the points. You want to find parameters such that

$y_i = b_0 + b_1 x_i +\epsilon_i$   (1)

has the smallest errors $\epsilon_i$ where  smallest usually, although not always, means in the least squares sense.

Hence, you want to minimize

$\chi^2=\sum_i (y_i-b_0+b_1 x_i)^2.$    (2)

Setting the derivatives of $\chi^2$ with respect to $b_0$ and $b_1$ to zero gives

$b_1= \sum (y_i -\bar{y})(x_i-\bar{x_i})/\sum_i(x_i-\bar{x})^2,$

where $\bar{y}=(1/N)\sum_i y_i$ and $\bar{x}=(1/N)\sum_i x_i$ and $b_0= \bar{y_i} - b_1 \bar{y_0}$.  The easy way to remember this is that the slope $b_1$ is equal to the covariance of x and y divided by the variance of x, e.g. $b_1={\rm cov}(x,y)/{\rm var}(x)$ and you get $b_0$ by taking the average of equation (1) and solving for $b_0$.

Recall that the Pearson correlation coefficient is given by $r = {\rm cov}(x,y)/\sqrt{{\rm var}(x){\rm var}(y)}$ so that $r =b_1 \sqrt{{\rm var}(x)/{\rm var}(y)}$.  Thus the correlation coefficient between x and y is given by the slope of the linear regression multiplied by the ratio of the standard deviations.  (The square of $r$ is the fraction of the variance explained by the regression.)  Now we can see where the term regression comes from.  If we “standardize” equation (1) (i.e. subtract the mean of y and divide by the standard deviation) and use the fact that $\bar{y_i}= b_0 + b_1\bar{x_i}$ we get

$\delta y_i = r \delta x_i$,  (3)

where $\delta y_i = (y_i-\bar{y_i})/\sqrt{{\rm var}(y_i)}$ and $\delta x_i = (x_i-\bar{x_i})/\sqrt{{\rm var}(x_i)}$.  What (3) implies is that the deviation from the mean in units of standard deviation of $y_i$ is always less than or equal to the standardized deviation from the mean of $x_i$, since $-1\le r \le 1$.

Thus y has “regressed to the mean”.  For example,  suppose y represents daughters and x represents mothers.  Then what this says is that the daughter will always be closer to the mean than the mother in terms of the respective standard deviations of daughters and mothers.   However, even though families tend to regress to the mean this does not imply that the variance of the distribution has to shrink in each generation.  The variance and mean of each generation could decrease, stay the same or increase.  It’s just where you are with respect to the population that changes.

There is also a linear algebra way of looking at linear regression, which is useful if you want to generalize to higher dimensions.  We can always generalize equation (1) to

$Y = XB$

where Y is a $N\times M$ dimensional matrix of data points, X is a $N\times P$ dimensional matrix of independent variables (i.e. regressors) and B is a $P\times M$ dimensional matrix of parameters.  For equation (1) Y would be a vector with elements $y_i$, X would be the matrix of all ones in the first column and elements $x_i$ in the second column and B would be the vector $[b_0, b_1]^T$.  In this form we see that we want to “invert this equation” to obtain the parameters B.  However, generally X is not an invertible square matrix so to solve this problem we multiply both sides by the transpose of X and then take the inverse giving

$B=(X^TX)^{-1}X^T Y$

This then is the only formula you have to remember.  In fact this is just the generalized version of the one dimensional formula where $X^T Y$ is the covariance between X and Y and $X^T X$ is the covariance/variance matrix of X.

Typo corrected April 1, 2011

## 3 thoughts on “Linear Regression”

1. Karl J. Kaiyala, PhD says:

I love this – it’s the most concise course on multiple regression anybody will ever pen. One tiny glitch, though, ought to be fixed:

“and you get b_0 by taking the average of equation (2) and solving for b_0”

should instead refer to equation (1).

Like

2. […] the original post on MCMC. With a linear model, you can write down the answer in closed form (see here), so it is a good model to test your algorithm and code.  Here it is in pseudo-Julia […]

Like