I think it’s fair to say that many physicists have little knowledge of statistics. As I posted previously, this mostly arises because there is little need for statistics in physics (see here and here). As a result, whenever they see someone attempting to explain data or extract information from data by fitting a statistical model, they’ll simply dismiss it as “merely curve fitting”. I’m going to argue here that this perceived dichotomy between a mechanistic model and a statistical model is artificial and in fact both camps could profit immensely by learning from each other. To a statistician, a model generally means capturing the data in terms of a smaller number of degrees of freedom. The model chosen will then depend on the data and any prior information. They then spend most of their effort testing the validity of their model and assessing the significance of the fit. To a physicist, a model implies some mechanistic description, often in the form of a function or differential equations that are based on prior knowledge that is independent of the particular data set.
For example, consider the activation of the immune system as a function of time following an infection. A physicist may try to model this by writing down a differential equation that accounts for the effect of a pathogen on the immune cells, cytokines and other physiological factors. Now, generally the physicist’s model will have a number of adjustable parameters. In the physicist worldview, the parameters would ideally be obtained by using pure reason or some empirical method that is independent of the data set. However, in many cases this is not possible and they must resort to fitting the model to the data. At this point, they may become despondent and be embarrassed to tell their friends lest they be accused of “merely curve fitting”. The first thing they may try is to adjust the parameters by hand and see how well it compares to the data. There are papers published to this day that still use this method. Then they may crack open a book on statistics or use some optimization software and do some form of gradient descent. Generally, this will involve minimizing the error between their model and the data by doing a local search around an initial guess. They may not know this but what they are doing is a maximum likelihood estimation of the data. They may then wonder if the model is actually doing a good job of fitting the data and they may slowly realize that what they really need to do is to compare to a second model (i.e. the null hypothesis). Pretty soon they’re trying all varieties of goodness-of-fit tests. They may eventually discover Bayesian model comparison and Markov Chain Monte Carlo (MCMC) methods. They’ll realize that they should be comparing lots of possible models and testing for which model fits the data best balancing closeness of fit with number of parameters.
This progression is approximately the path I took from applied mathematician to accidental statistician. There are two lessons to be learned. The first is that fitting a model to data is nontrivial and interesting. Optimization theory, Bayesian inference, and machine learning are fascinating and deep fields. The second is that the only difference between a statistical model and a mechanistic model is the form of the model. The latter is usually more complicated. Basically, if you need to fit a model, a model is just a model. In some instances, a straight line (i.e. linear regression) is the appropriate model. If you want to know how much smoking increases your chance of a heart attack, linear regression (controlling for other factors) gives you the answer. A biophysically faithful model of the entire body isn’t going to be all that useful. On the other hand, statisticians could benefit by expanding their repetoire to including more mechanistic models. For example, in the work I’ve done recently on obesity, a simple mechanistic model can go a long way in predicting how much less one should eat to lose 10 pounds. A component of my research these days involves marrying dynamical systems to statistics, in particular Bayesian methods. I think there are many opportunities to develop faster and better algorithms to fit differential equations to data. In a future post, I will describe how my group has been using MCMC methods to estimate Bayesian posterior distributions for parameters and do Bayesian model comparison.