## Archive for the ‘Mathematics’ Category

### Bayesian model comparison Part 2

May 11, 2013

In a previous post, I summarized the Bayesian approach to model comparison, which requires the calculation of the Bayes factor between two models. Here I will show one computational approach that I use called thermodynamic integration borrowed from molecular dynamics. Recall, that we need to compute the model likelihood function

$P(D|M)=\int P((D|M,\theta)P(\theta|M) d\theta$     (1)

for each model where $P(D|M,\theta)$ is just the parameter dependent likelihood function we used to find the posterior probabilities for the parameters of the model.

The integration over the parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition function

$Z(\beta) = \int P(D|M,\theta)^\beta P(\theta| M) d\theta$    (2)

where $\beta$ is an inverse temperature. The derivative of the log of the partition function gives

$\frac{d}{d\beta}\ln Z(\beta)=\frac{\int d\theta \ln[P(D |\theta,M)] P(D | \theta, M)^\beta P(\theta|M)}{\int d\theta \ P(D | \theta, M)^\beta P(\theta | M)}$    (3)

which is equal to the ensemble average of $\ln P(D|\theta,M)$. However, if we assume that the MCMC has reached stationarity then we can replace the ensemble average with a time average $\frac{1}{T}\sum_{i=1}^T \ln P(D|\theta, M)$.  Integrating (3) over $\beta$ from 0 to 1 gives

$\ln Z(1) = \ln Z(0) + \int \langle \ln P(D|M,\theta)\rangle d\beta$

From (1) and (2), we see that  $Z(1)=P(D|M)$, which is what we want to compute  and $Z(0)=\int P(\theta|M) d\theta=1$.

Hence, to perform Bayesian model comparison, we simply run the MCMC for each model at different temperatures (i.e. use $P(D|M,\theta)^\beta$ as the likelihood in the standard MCMC) and then integrate the log likelihoods $Z(1)$ over $\beta$ at the end. For a Gaussian likelihood function, changing temperature is equivalent to changing the data “error”. The higher the temperature the larger the presumed error. In practice, I usually run at seven to ten different values of $\beta$ and use a simple trapezoidal rule to integrate over $\beta$.  I can even do parameter inference and model comparison in the same MCMC run.

Erratum, 2013-5-2013,  I just fixed an error in the final formula

### New paper on finite size effects in spiking neural networks

January 25, 2013

Michael Buice and I have finally published our paper entitled “Dynamic finite size effects in spiking neural networks” in PLoS Computational Biology (link here). Finishing this paper seemed like a Sisyphean ordeal and it is only the first of a series of papers that we hope to eventually publish. This paper outlines a systematic perturbative formalism to compute fluctuations and correlations in a coupled network of a finite but large number of spiking neurons. The formalism borrows heavily from the kinetic theory of plasmas and statistical field theory and is similar to what we used in our previous work on the Kuramoto model (see here and  here) and the “Spike model” (see here).  Our heuristic paper on path integral methods is  here.  Some recent talks and summaries can be found here and here.

### Creating vs treating a brain

January 23, 2013

The NAND (Not AND) gate is all you need to build a universal computer. In other words, any computation that can be done by your desktop computer, can be accomplished by some combination of NAND gates. If you believe the brain is computable (i.e. can be simulated by a computer) then in principle, this is all you need to construct a brain. There are multiple ways to build a NAND gate out of neuro-wetware. A simple example takes just two neurons. A single neuron can act as an AND gate by having a spiking threshold high enough such that two simultaneous synaptic events are required for it to fire. This neuron then inhibits the second neuron that is always active except when the first neuron receives two simultaneous inputs and fires. A network of these NAND circuits can do any computation a brain can do.  In this sense, we already have all the elementary components necessary to construct a brain. What we do not know is how to put these circuits together. We do not know how to do this by hand nor with a learning rule so that a network of neurons could wire itself. However, it could be that the currently known neural plasticity mechanisms like spike-timing dependent plasticity are sufficient to create a functioning brain. Such a brain may be very different from our brains but it would be a brain nonetheless.

The fact that there are an infinite number of ways to creating a NAND gate out of neuro-wetware implies that there are an infinite number of ways of creating a brain. You could take two neural networks with the same set of neurons and learning rules, expose them to the same set of stimuli and end up with completely different brains. They could have the same capabilities but be wired differently. The brain could be highly sensitive to initial conditions and noise so any minor perturbation would lead to an exponential divergence in outcomes. There might be some regularities (like scaling laws) in the connections that could be deduced but the exact connections would be different. If this were true then the connections would be everything and nothing. They would be so intricately correlated that only if taken together would they make sense. Knowing some of the connections would be useless. The real brain is probably not this extreme since we can sustain severe injuries to the brain and still function. However, the total number of hard-wired conserved connections cannot exceed the number of bits in the genome. The other connections (which is almost all of them) are either learned or are random. We do not know which is which.

To clarify my position on the Hopfield Hypothesis, I think we may already know enough to create a brain but we do not know enough to understand our brain. This distinction is crucial.  What my lab has been interested in lately is to understand and discover new treatments for cognitive disorders like Autism (e.g. see here). This implies that we need to know how perturbations at the cellular and molecular levels affect the behavioural level.  This is an obviously daunting task. Our hypothesis is that the bridge between these two extremes is the canonical cortical circuit consisting of recurrent excitation and lateral inhibition. We and others have shown that such a simple circuit can explain the neural firing dynamics in diverse tasks such as working memory and binocular rivalry (e.g. see here). The hope is that we can connect the genetic and molecular perturbations to the circuit dynamics and then connect the circuit dynamics to behavior. In this sense, we can circumvent the really hard problem of how the canonical circuits are connected to each other. This may not lead to a complete understanding of the brain or the ability to treat all disorders but it may give insights into how genes and medication act on cognitive function.

### Von Neumann’s response

December 11, 2012

Here’s Von Neumann’s response to straying from pure mathematics:

“[M]athematical ideas originate in empirics, although the genealogy is sometimes long and obscure. But, once they are so conceived, the subject begins to live a peculiar life of its own and is better compared to a creative one, governed by almost entirely aesthetic considerations, than to anything else, and, in particular, to an empirical science. There is, however, a further point which, I believe, needs stressing. As a mathematical discipline travels far from its empirical source, or still more, if it is a second and third generation only indirectly inspired by ideas coming from ‘reality’, it is beset with very grave dangers. It becomes more and more purely aestheticising, more and more purely l’art pour l’art. This need not be bad, if the field is surrounded by correlated subjects, which still have closer empirical connections, or if the discipline is under the influence of men with an exceptionally well-developed taste. But there is a grave danger that the subject will develop along the line of least resistance, that the stream, so far from its source, will separate into a multitude of insignificant branches, and that the discipline will become a disorganised mass of details and complexities. In other words, at a great distance from its empirical source, or after much ‘abstract’ inbreeding, a mathematical subject is in danger of degeneration.”

Thanks to James Lee for pointing this out.

### Complexity is the narrowing of possibilities

December 6, 2012

Complexity is often described as a situation where the whole is greater than the sum of its parts. While this description is true on the surface, it actually misses the whole point about complexity. Complexity is really about the whole being much less than the sum of its parts. Let me explain. Consider a television screen with 100 pixels that can be either black or white. The number of possible images the screen can show is $2^{100}$. That’s a really big number. Most of those images would look like random white noise. However, a small set of them would look like things you recognize, like dogs and trees and salmon tartare coronets. This narrowing of possibilities, or a reduction in entropy to be more technical, increases information content and complexity. However, too much reduction of entropy, such as restricting the screen to be entirely black or white, would also be considered to have low complexity. Hence, what we call complexity is when the possibilities are restricted but not completely restricted.

Another way to think about it is to consider a very high dimensional system, like a billion particles moving around. A complex system would be if the attractor of this six billion dimensional system (3 for position and 3 for velocity of each particle), is a lower dimensional surface or manifold.  The flow of the particles would then be constrained to this attractor. The important thing to understand about the system would then not be the individual motions of the particles but the shape and structure of the attractor. In fact, if I gave you a list of the positions and velocities of each particle as a function of time, you would be hard pressed to discover that there even was a low dimensional attractor. Suppose the particles lived in a box and they moved according to Newton’s laws and only interacted through brief elastic collisions. This is an ideal gas and what would happen is that the motions of the positions of the particles would be uniformly distributed throughout the box while the velocities would obey a Normal distribution, called a Maxwell-Boltzmann distribution in physics. The variance of this distribution is proportional to the temperature. The pressure, volume, particle number and temperature will be related by the ideal gas law, PV=NkT, with the Boltzmann constant set by Nature. An ideal gas at equilibrium would not be considered complex because the attractor is a simple fixed point. However, it would be really difficult to discover the ideal gas law or even the notion of temperature if one only focused on the individual particles. The ideal gas law and all of thermodynamics was discovered empirically and only later justified microscopically through statistical mechanics and kinetic theory. However, knowledge of thermodynamics is sufficient for most engineering applications like designing a refrigerator. If you make the interactions longer range you can turn the ideal gas into a liquid and if you start to stir the liquid then you can end up with turbulence, which is a paradigm of complexity in applied mathematics. However, the main difference between an ideal gas and turbulent flow is the dimension of the attractor. In both cases, the attractor dimension is still much smaller than the full range of possibilities.

The crucial point is that focusing on the individual motions can make you miss the big picture. You will literally miss the forest for the trees. What is interesting and important about a complex system is not what the individual constituents are doing but how they are related to each other. The restriction to a lower dimensional attractor is manifested by the subtle correlations of the entire system. The dynamics on the attractor can also often be represented by an “effective theory”. Here the use of the word “effective” is not to mean that it works but rather that the underlying microscopic theory is superseded by a macroscopic one. Thermodynamics is an effective theory of the interaction of many particles. The recent trend in biology and economics had been to focus on the detailed microscopic interactions (there is push back in economics in what has been dubbed the macro-wars). As I will relate in future posts, it is sometimes much more effective (in the works better sense) to consider the effective (in the macroscopic sense) theory than a detailed microscopic theory. In other words, there is no “theory” per se of a given system but rather sets of effective theories that are to be selected based on the questions being asked.

### Von Neumann

December 4, 2012

Steve Hsu has a link to a fascinating documentary on John Von Neumann. It’s definitely worth watching.  Von Neumann is probably the last great polymath. Mathematician Paul Halmos laments that Von Neumann perhaps wasted his mathematical gifts by spreading himself too thin. He worries that Von Neumann will only be considered a minor figure in pure mathematics several hundred years hence. Edward Teller believes that Von Neumann simply enjoyed thinking above all else.

### Prediction requires data

November 8, 2012

Nate Silver has been hailed in the media as a vindicated genius for correctly predicting the election. He was savagely attacked before the election for predicting that Obama would win handily. Kudos also go to Sam Wang, Pollester.com, electoral-vote.com, and all others who simply took the data obtained from the polls seriously. Hence, the real credit should go to all of the polling organizations for collectively not being statistically biased.  It didn’t matter if single organizations were biased one way or the other as long as they were not correlated in their biases. The true power of prediction in this election was that the errors of the various pollsters were independently distributed. However, even if you didn’t take the data at face value, you could still reasonably predict the election. Obama had an inherent advantage because he had more paths to winning 270 electoral votes. Suppose there were 8 battleground states and Romney needed to win at least 6 of them. Hence, Romney had 28 ways to win while Obama had 228 ways to win. If the win probability was approximately a half in each of these states, which is what a lot of people claimed,  then Romney has slightly more than one in ten chance of winning, which is close to the odds given by Sam. The only way Romney’s odds would increase is if the state results were correlated in his favour. However, it would take a lot of correlated bias to predict that Romney was a favourite.

Erratum, Nov 9 2012:  Romney actually has 37 ways and Obama 219 in my example.  The total must add up to 2^8=256.  I forgot to include the fact that Romney could also win 7 of 8 states or all states in his paths to winning.

### Predicting the election

November 3, 2012

The US presidential election on Nov. 6 is expected to be particularly close. The polling has been vigorous and there are many statistical prediction web sites. One of them, the Princeton Election Consortium, is run by neuroscientist Sam Wang at Princeton University. For any non-American readers, the president is not elected directly by the citizens but through what is called the electoral college.  This is a set of 538 electoral voters that are selected by the individual states. The electoral votes are allotted according to the number of congressional districts per state plus two. Hence, low population states are over-represented. Almost all states agree that the candidate that takes the plurality of the votes in that state wins all the electoral votes of that state. Maine and Nebraska are the two exceptions that allot electoral votes according to who wins the congressional district. Thus in order to predict who will win, one must predict who will get at least 270 electoral votes. Most of the states are not competitive so the focus of the candidates (and media) are on a handful of so-called battleground states like Ohio and Colorado. Currently, Sam Wang predicts that President Obama will win the election with a median of 319 votes. Sam estimates the Bayesian probability for Obama’s re-election to be 99.6%. Nate Silver at another popular website (Five Thirty Eight), predicts that Obama will win 305 electoral votes and has a re-election probability of 83.7%.

These estimates are made by using polling data with a statistical model. Nate Silver uses national and state polls along with some economic indicators, although the precise model is unknown. Sam Wang uses only state polls. I’ll describe his method here. The goal is to estimate the probability distribution for the number of electoral votes a specific candidate will receive. The state space consists of $2^{51}$ possibilities (50 states plus the District of Columbia). I will assume that Maine and Nebraska do not split their votes along congressional districts although it is a simple task to include that possibility. Sam assumes that the individual states are statistically independent so that the joint probability distribution factorizes completely. He then takes the median of the polls for each state over some time window to represent the probability of that given state. The polling data is comprised of the voting preferences of a sample for a given candidate. The preferences are converted into probabilities using a normal distribution. He then computes the probability for all $2^{51}$ combinations. Suppose that there are just two states with win probabilities for your candidate of $p_1$ and $p_2$. The probability of your candidate winning both states is $p_1 p_2$, state 1 but not state 2 is $p_1(1-p_2)$, and so forth.  If the states have $EV_1$ and $EV_2$ electoral votes respectively then if they win both states they will win $EV_1+EV_2$ votes and so forth. To keep the bookkeeping simple, Sam uses the trick of expressing the probability distribution as a polynomial of a dummy variable $x$.  So the probability distribution is

$(p_1 x^{EV_1} + 1-p_1)(p_2 x^{EV_2} + 1-p_2)$

$= p_1 p_2 x^{EV_1+EV_2} + p_1(1-p_2) x^{EV_1} + (1-p_1)p_2 x^{EV_2} + (1-p_1)(1-p_2)$

Hence, the coefficient of each term is the probability for the number of electoral votes given by the exponent of $x.$The expression for 51 “states” is  $\prod_{i=1}^{51} (p_i x^{EV_i} + 1-p_i)$ and this can be evaluated quickly on a desktop computer. One can then take the median or mean of the distribution for the predicted  number of electoral votes. The sum of the probabilities for electoral votes greater than 269 gives the winning probability, although Sam uses a more sophisticated method for his predicted probabilities. The model does assume that the probabilities are independent.  Sam tries to account for this by using what he calls a meta-margin, in which he calculates how much the probabilities (in terms of preference) need to move for the leading candidate to lose. Also, the state polls will likely pick up any correlations as the election gets closer.

Most statistical models predict that Obama will be re-elected with fairly high probability but the national polls are showing that the race is almost tied. This discrepancy is a puzzle.  Silver’s hypothesis for why is here and Sam’s is here.  One of the sources for error in polls is that they must predict who will vote.  The 2008 election had a voter turnout of a little less than 62%. That means that an election can be easily won or lost based on turnout alone, which makes one wonder about democracy.

### Revised SDE and path integral paper

October 10, 2012

At the MBI last week, I gave a tutorial on using path integrals to compute moments of stochastic differential equations perturbatively.  The slides are the same as the tutorial I gave a few years ago (see here).  I slightly modified the review paper that goes with the talk. I added the explicit computation for the generating functional of the complex Gaussian PDF. The new version can be found here.

### Strogatz in the Times

September 19, 2012

Don’t miss Steve Strogatz’s new series on math in the New York Times. Once again Steve manages to make math both interesting and understandable.

### A new strategy for the iterated prisoner’s dilemma game

September 4, 2012

The game theory world was stunned recently when Bill Press and Freeman Dyson found a new strategy to the iterated prisoner’s dilemma (IPD) game. They show how you can extort an opponent such that the only way they can maximize their payoff is to give you an even higher payoff. The paper, published in PNAS (link here) with a commentary (link here), is so clever and brilliant that I thought it would be worthwhile to write a pedagogical summary for those that are unfamiliar with some of the methods and concepts they use. This paper shows how knowing a little bit of linear algebra can go a really long way to exploring deep ideas.

In the classic prisoner’s dilemma, two prisoner’s are interrogated separately. They have two choices. If they both stay silent (cooperate) they get each get a year in prison. If one confesses (defects) while the other stays silent then the defector is released while the cooperator gets 5 years.  If both defect then they both get 3 years in prison. Hence, even though the highest utility for both of them is to both cooperate, the only logical thing to do is to defect. You can watch this played out on the British television show Golden Balls (see example here). Usually the payout is expressed as a reward so if they both cooperate they both get 3 points, if one defects and the other cooperates then the defector gets 5 points and the cooperator gets zero,  and if they both defect they both get 1  point each. Thus, the combined reward is higher if they both cooperate but since they can’t trust their opponent it is only logical to defect and get at least 1 point.

The prisoner’s dilema changes if you play the game repeatedly because you can now adjust to your opponent and it is not immediately obvious what the best strategy is. Robert Axelrod brought the IPD to public attention when he organized a tournament three decades ago. The results are published in his 1984 book The Evolution of Cooperation.  I first learned about the results in Douglas Hofstader’s Metamagical Themas column in Scientific American in the early 1980s. Axelrod invited a number of game theorists to submit strategies to play IPD and the winner submitted by Anatol Rappaport was called tit-for-tat, where you always cooperate first and then do whatever your opponent does.  Since this was a cooperative strategy with retribution, people have been using this example of how cooperation could evolve ever since those results. Press and Dyson now show that you can win by being nasty. Details of the calculations are below the fold.

### What’s your likelihood?

August 10, 2012

Different internal Bayesian likelihood functions may be why we disagree. The recent shooting tragedies in Colorado and Wisconsin have set off a new round of arguments about gun control. Following the debate has made me realize that the reason the two sides can’t agree (and why different sides of almost every controversial issue can’t agree) is that their likelihood functions are completely different. The optimal way to make inferences about the world is to use Bayesian inference and there is some evidence that we are close to optimal in some circumstances. Nontechnically, Bayesian inference is a way to update  the strength of your belief in something (i.e. probability) given new data. What you do is to combine your prior probability with the likelihood of the data given your internal model of the issue (and then normalize to get a posterior probability). For a more technical treatment of Bayesian inference, see here. I posted previously (see here) that I thought that drastic differences in prior probabilities is why people don’t seem to update their beliefs when faced with overwhelming evidence to the contrary.  However, I’m starting to realize that the main reason might be that they have completely different models of the world, which in technical terms is their likelihood function.

Consider the issue of gun control.  The anti-gun control side argue that “guns don’t kill people, people kill people’ and that restricting access to guns won’t prevent determined malcontents from coming up with some means to kill. The pro-gun control side argues that the amount of gun violence is inversely proportional to the ease of access to guns. After all, you would be hard pressed to kill twenty people in a movie theatre with a sword. The difference in these two viewpoints can be summarized by their models of the world.  The anti-gun control people believe that the distribution of the will of people who would commit violence looks like this

where the horizontal line represents a level of gun restriction.  In this world view, no amount of gun restriction would prevent these people from undertaking their nefarious designs.  On the other hand, the pro-gun control side believes that the same distribution looks like this

in which case, the higher you set the barrier the fewer the number of crimes committed. Given these two views of the world, it is clear why new episodes of gun violence like the recent ones in Colorado and Wisconsin do not change people’s minds. What you would need to do is to teach a new likelihood function to the other side and that may take decades if at all.

### Is abstract thinking necessary?

August 1, 2012

Noted social scientist, Andrew Hacker, wrote a provocative opinion piece in the New York Times Sunday arguing that we relax mathematics requirements for higher education. Here are some excerpts from his piece:

New York Times: A TYPICAL American school day finds some six million high school students and two million college freshmen struggling with algebra. In both high school and college, all too many students are expected to fail. Why do we subject American students to this ordeal? I’ve found myself moving toward the strong view that we shouldn’t.

…There are many defenses of algebra and the virtue of learning it. Most of them sound reasonable on first hearing; many of them I once accepted. But the more I examine them, the clearer it seems that they are largely or wholly wrong — unsupported by research or evidence, or based on wishful logic. (I’m not talking about quantitative skills, critical for informed citizenship and personal finance, but a very different ballgame.)

…The toll mathematics takes begins early. To our nation’s shame, one in four ninth graders fail to finish high school. In South Carolina, 34 percent fell away in 2008-9, according to national data released last year; for Nevada, it was 45 percent. Most of the educators I’ve talked with cite algebra as the major academic reason.

The expected reaction from some of my colleagues was understandably negative. After all, we live in a world that is becoming more complex requiring more mathematical skills not less. Mathematics is as essential to one’s education as reading. In the past, I too would have whole heartedly agreed. However, over the past few years I have started think otherwise. Just to clarify, Hacker does not (nor I) believe that critical thinking is unimportant. He argues forcefully that all citizens should have a fundamental grounding in the concepts of arithmetic, statistics and quantitative reasoning. I have even posted before (see here)  that I thought mathematics should be part of the accepted canon of what an educated citizen should know and I’m not backing away from that belief. Hacker thinks we should be taught a “citizen’s statistics” course. My suggested course was:  “Science and mathematics survival tools for the modern world.”  The question is whether or not we should expect all students to master the abstract reasoning skills necessary for algebra.

I’ll probably catch a lot of flack for saying this but from my professional and personal experience, I believe that there is a significant fraction of the population that is either unable or unwilling to think abstractly.  I also don’t think we can separate lack of desire from lack of ability. The willingness to learn something may be just as “innate” as the ability to do something. I think everyone can agree that on the abstract thinking scale almost everyone can learn to add and subtract but only a select few can understand cohomology theory.  In our current system, we put high school algebra as the minimum threshold, but is this a reasonable place to draw the line? What we need to know  is the distribution of people’s maximum capacity for abstract thinking. The current model requires that  the distribution be  almost zero left of algebra with a fat tail on the right. But what if the actual distribution is broad with a peak somewhere near calculus?  In this case, there would be a large fraction of the population to the left of algebra. This is pure speculation but there could even be a neurophysiological basis to abstract thinking in terms of the fraction of neural connections within higher cortical areas versus connections between cortical and sensory areas. There could be a trade-off between abstract thinking and sensory processing. This need not even be purely genetic. As I posted before, not all the neural connections can be set by the genome so most are either random or arise through plasticity.

To me, the most important issue that Hacker brings up is not whether or not we should make everyone learn algebra but what should we do about the people who don’t and as a result are denied the opportunity to attend college and secure a financially stable life. Should we devote our resources to try to teach it to them better or should we develop alternative ways for these people to be productive in our society? I really think we should re-evaluate the goal that everyone goes to college. In fact, given the exorbitant cost and the rise of online education, the trend away from traditional college may have already begun. We should put more emphasis on apprenticeship programs and community colleges. Given the rapid rate of change in the job market, education and training should be thought of as a continual process instead of the current model of four years and out. I do believe that a functional democracy requires an educated citizenry. However, college attendance has been steadily increasing the past few decades but one would be hard pressed to argue that democracy has concomitantly improved. A new model may be in order.

### Log normal

May 25, 2012

A comment to my previous post correctly points out that the income distribution is approximately log-normal. What this means is that while income itself is not normally distributed, the logarithm of income is.  The log-normal distribution has a pretty fat tail for high incomes. A variable will be log-normal if it is the product of a lot of random variables, since the log of a product is a sum. It has been argued for many years that achievement should be log-normal because it involves the product of many independent events. This is why a good programmer can be hundreds of times better than a mediocre one.  I even gave a version of this argument here. Hence, small differences in innate ability can lead to potentially large differences in outcome. However, despite the fact that income may deviate from log-normality in some cases and in particular between sectors of the economy (e.g. finance vs. philosophy), there is still a question of whether the compensation scheme needs to follow log-normal even if productivity does. After all, if small differences in innate ability are magnified to such a large extent, one could argue that income should be pegged to the log of productivity.

### Nonlinearity in your wallet

May 25, 2012

Many human traits like height, IQ, and 50 metre dash times are very close to being normally distributed. The normal distribution (more technically the normal probability density function) or Gaussian function

$f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/2\sigma^2}$

is the famous bell shaped curve that the histogram of class grades fall on. The shape of the Gaussian is specified by two parameters the mean $\mu$, which coincides with the peak of the bell, and the standard deviation $\sigma$, which is a measure of how wide the Gaussian is. Let’s take height as an example. There is a 68% chance that any person will be within one standard deviation of the mean and a little more than 95% that you will be within two standard deviations. The tallest one percent are about 2.3 standard deviations from the mean.

The fact that lots of things are normally distributed  is not an accident but a consequence of the central limit theorem (CLT), which may be the most important mathematical law in your life. The theorem says that the probability distribution of a sum of a large number of random things will be normal (i.e. a Gaussian). In the example of height, it suggests that there are perhaps hundreds or thousands of genetic and environmental factors that determine your height, each contributing a little amount. When you add them together you get your height and the distribution is normal.

Now, the one major thing in your life that bucks the normal trend is income and especially wealth distribution. Incomes are extremely non-normal. They have what are called fat tails, meaning that the income of the top earners are much higher than would be expected by a normal distribution. A general rule of thumb called the Pareto Principle is that 20% of the population controls 80% of the wealth. It may even be more skewed these days.

There are many theories as to why income and wealth is distributed the way it is and I won’t go into any of these. What I want to point out is that whatever it is that governs income and wealth, it is definitely nonlinear. The key ingredient in the CLT is that the factors add linearly. If there were some nonlinear combination of the variables then the result need not be normal. It has been argued that some amount of inequality is unavoidable given that we are born with unequal innate traits but the translation of those differences into  income inequality is a social choice to some degree. If we rewarded the contributors to income more linearly, then incomes would be distributed more normally (there would be some inherent skew because incomes must be positive). In some sense, the fact that some sectors of the economy seem to have much higher incomes than other sectors implies a market failure.

### Known unknown unknowns

April 20, 2012

I  listened today to the podcast of science fiction writer Bruce Sterling’s Long Now Foundation talk from 2004 on “The Singularity: Your Future as a Black Hole”.  The talk is available here.  Sterling describes some of the ideas of  mathematician and science fiction writer Vernor Vinge’s conception of the singularity as a scary moment in time where super human intelligence ends the human era and we have no way to predict what will happen.  I won’t address the issue of whether or not such a moment in time will or not happen in the near future or ever.  I’ve posted about it in the past (e.g. see here, here and here).  What I do want to discuss is whether or not there can exist events or phenomena that are so incomprehensible that it will reduce us to a quivering pile of mush.  I think an excellent starting point is former US Secretary of Defense, Donald Rumsfeld’s infamous speech from 2002 regarding the link between Iraq and weapons of mass destruction prior to the Iraq war, where he said:

[T]here are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – there are things we do not know we don’t know.

While Rumsfeld was mocked by the popular media for this seemingly inane statement, I actually think (geopolitical consequences aside) that it was the deepest thing I ever heard him say. Rumsfeld is a Bayesian! There is a very important distinction between known unknowns and unknown unknowns.  In the first case, we can assign a probability to the event.  In the second we cannot.  Stock prices are known unknowns, while black swans are unknown unknowns.  (Rumsfeld’s statement predates Nassim Taleb’s book.)  My question is not whether we can predict black swans (by definition we cannot) but whether something can ever occur that we wouldn’t even be able to describe it much less understand it.

In Bayesian language, a known unknown would be any event for which a prior probability exists.    Unknown unknowns are events for which there is no prior.  It’s not just that the prior is zero, but also that it is not included in our collection (i.e. sigma algebra) of possibilities. Now, the space of all possible things is uncountably infinite so it seems likely that there will be things that you cannot imagine. However, I claim that by simply acknowledging that there exist things that I cannot possible ever imagine, is sufficient to remove the surprise. We’ve witnessed enough in the post-modern world, to assign a nonzero prior to the possibility that anything can and will happen. That is not to say that we won’t be very upset or disturbed by some event. We may read about some horrific act of cruelty tomorrow that will greatly perturb us but it won’t be inconceivably shocking.  The entire world could vanish tomorrow and be replaced by an oversized immersion blender and while I wouldn’t be very happy about it and would be extremely puzzled by how it happened, I would not say that it was impossible. Perhaps I won’t be able to predict what will happen after the singularity arrives but I won’t be surprised by it.

### The Mandelbrot set

March 16, 2012

The Mandelbrot set is often held up as an example of how amazing complexity can be generated from a simple dynamical system.  In comments to my previous posts on the information content or Kolmogorov complexity of the brain, it was brought up as an example of how the brain could be very complex yet still be fully specified by the genome.  While, I agree with this premise, the Mandelbrot set is not the best example to show this.   Now, the Mandelbrot set is a beautiful example of how you can generate incredibly complex fractal landscapes using a simple algorithm.  However, it takes an uncountably infinite amount of information to specify it.

Let’s be more precise.  Consider the iterative map $z \rightarrow z^2 +C$.  Pick any complex number for $C$ and iterate the map starting at $z=0$.  The ensuing iterates or orbit will either go to infinity or stay bounded.  The Mandelbrot set is the set of all points that you use for $C$ that stay bounded.  In essence, it consists of all complex numbers such that the series $C$, $C^2 +C$, $(C^2+C)^2 +C$ stays bounded.  You can immediately rule out some numbers.  You know that zero will always stay bounded and you also know that any number with absolute magnitude greater than 2 will also go to infinity.  In fact, to compute the Mandelbrot set, you just have to see if any iterate exceeds 2 because after that you know it is gone.  The question then is what happens in between and it turns out that the boundary of the Mandelbrot set is this beautiful fractal shape that looks like sea horses within sea horses and so forth.

The question then is how much information do you need to construct the Mandelbrot set.  The answer as proved by Blum, Cucker, Shub, and Smale (see their book Complexity and Real Computation), is that the Mandelbrot set is undecidable.  There is no algorithm to obtain the boundaries of the Mandelbrot set.  In other words, you would need an uncountable amount of information to specify it.  The beautiful pictures we see, as shown above, are only approximations to the set.

However, I am not hostile to the idea that simple things can generate complexity.  One could say that my career is based on this idea.  It is what chaos theory is all about.  I use the argument all the time.  I’m just saying that the Mandelbrot set is not a great example.  Perhaps, a better example is to say let’s consider the logistic map on real numbers $x \rightarrow rx(x-1)$.  If r is between zero and one, then all orbits will eventually go to zero but as you increase r, the nature of the orbits will change and eventually you’ll reach a periodic doubling cascade to chaos. If you choose an r that is slightly bigger than 3.57 then you’ll get chaos.  This implies that small changes to the initial conditions will give you completely different results and also that if you just plot the iterates coming out of the map, they will seem to have no apparent pattern.  If you were to naively estimate the complexity or information content of the orbit, you could be led to believe that it has high information content even though the Kolmogorov complexity is actually quite small and is given by the logistic map and the initial condition.  However, this may also not be the greatest example because there are ways to deduce that the orbit came from a low dimensional chaotic system rather than a high dimensional system.

### Talk at MBI

February 23, 2012

I’m currently about to give a talk at a workshop  on statistical inference in biology at the Mathematical Biosciences Institute at Ohio State.  My talk is a variation of previous ones on using Bayesian methods for parameter estimation and model comparison.  The slides are here.

### New paper in Biophysical Journal

February 14, 2012

Bayesian Functional Integral Method for Inferring Continuous Data from Discrete Measurements

Biophysical Journal, Volume 102, Issue 3, 399-406, 8 February 2012

doi:10.1016/j.bpj.2011.12.046

William J. Heuett, Bernard V. Miller, Susan B. Racette, John O. Holloszy, Carson C. Chow, and Vipul Periwal

Abstract: Inference of the insulin secretion rate (ISR) from C-peptide measurements as a quantification of pancreatic β-cell function is clinically important in diseases related to reduced insulin sensitivity and insulin action. ISR derived from C-peptide concentration is an example of nonparametric Bayesian model selection where a proposed ISR time-course is considered to be a “model”. An inferred value of inaccessible continuous variables from discrete observable data is often problematic in biology and medicine, because it is a priori unclear how robust the inference is to the deletion of data points, and a closely related question, how much smoothness or continuity the data actually support. Predictions weighted by the posterior distribution can be cast as functional integrals as used in statistical field theory. Functional integrals are generally difficult to evaluate, especially for nonanalytic constraints such as positivity of the estimated parameters. We propose a computationally tractable method that uses the exact solution of an associated likelihood function as a prior probability distribution for a Markov-chain Monte Carlo evaluation of the posterior for the full model. As a concrete application of our method, we calculate the ISR from actual clinical C-peptide measurements in human subjects with varying degrees of insulin sensitivity. Our method demonstrates the feasibility of functional integral Bayesian model selection as a practical method for such data-driven inference, allowing the data to determine the smoothing timescale and the width of the prior probability distribution on the space of models. In particular, our model comparison method determines the discrete time-step for interpolation of the unobservable continuous variable that is supported by the data. Attempts to go to finer discrete time-steps lead to less likely models.

### Proof by simulation

February 7, 2012

The process of science and mathematics involves developing ideas and then proving them true.   However, what is meant by a proof depends on what one is doing.  In science, a proof is empirical.  One starts with a hypothesis and then tests it experimentally or observationally.  In pure math, a proof means that a given statement is consistent with a set of rules and axioms.  There is a huge difference between these two approaches.  Mathematics is completely internal.  It simply strives for self-consistency.  Science is external.  It tries to impose some structure on an outside world.  This is why mathematicians sometimes can’t relate to scientists and especially physicists and vice versa.

Theoretical physicists don’t need to always follow rules.  What they can do is to make things up as they go along.  To make a music analogy – physics is like jazz.  There is a set of guidelines but one is free to improvise.  If in the middle of a calculation one is stuck because they can’t solve a complicated equation, then they can assume something is small or big or slow or fast and replace the equation with a simpler one that can be solved.  One doesn’t need to know if any particular step is justified because all that matters is that in the end, the prediction must match the data.

Math is more like composing western classical music.  There are a strict set of rules that must be followed.  All the notes must fall within the diatonic scale framework.  The rhythm and meter  is tightly regulated.  There are a finite number of possible choices at each point in a musical piece just like a mathematical proof.  However,  there are a countably infinite number of possible musical pieces just as there are an infinite number of possible proofs. That doesn’t mean that rules can’t be broken, just that when they are broken a paradigm shift is required to maintain self-consistency in a new system.  Whole new fields of mathematics and genres of music arise when the rules are violated.

The invention of the computer introduced a third means of proof.  Prior to the computer,  when making an approximation, one could either take the mathematics approach and try to justify the approximation by putting bounds on the error terms analytically or take the physicist approach and compare the end result with actual data.  Now one can numerically solve the more complicated expression and compare it directly to the approximation. I would say that I have spent the bulk of my career doing just that. Although, I don’t think there is anything intrinsically wrong with proving my simulation, I do find it to be unsatisfying at times. Sometimes it is nice to know that something is true by proving it in the mathematical sense and other times it is gratifying to compare predictions directly with experiments. The most important thing is to always be aware of what mode of proof one is employing.  It is not always clear-cut.