## Archive for the ‘Probablity’ Category

### Big Data backlash

April 7, 2014

I predicted that there would be an eventual push back on Big Data and it seems that it has begun. Gary Marcus and Ernest Davis of NYU had an op-ed in the Times yesterday outlining nine issues with Big Data. I think one way to encapsulate many of the critiques is that you will never be able to do true prior free data modeling. The number of combinations in a data set grows as the factorial of the number of elements, which grows faster than an exponential. Hence, Moore’s law can never catch up. At some point, someone will need to exercise some judgement in which case Big Data is not really different from the ordinary data that we deal with all the time.

### The myth of the single explanation

July 30, 2013

I think one of the things that tends to lead us astray when we try to understand complex phenomena like evolution, disease, or the economy, is that we have this idea that they must have a single explanation. For example, recently two papers have been published in high profile journals trying to explain mammal monogamy. Although monogamy is quite common in birds it only occurs in 5% of mammals. Here is Carl Zimmer’s summary.  The study in Science, which surveyed 2545 mammal species, argued that monogamy arises when females are solitary and sparse. Males must then commit to one since dates are so hard to find. The study in PNAS examined 230 primate species, for which monogamy occurs at the higher rate of 27%, and used Bayesian inference to argue that monogamy arises to prevent male infanticide. It’s better to help out at home rather than go around killing other men’s babies. Although both of these arguments are plausible, there need not be a single universal explanation. Each species could have its own set of circumstances that led to monogamy involving these two explanations and others. However, while we should not be biased towards a single explanation, we shouldn’t also throw up our hands like Hayek and argue that no complex phenomenon can be understood. Some phenomena will have simpler explanations than others but since the Kolmogorov complexity is undecidable there is no algorithm that can tell you which is which. We will just have to struggle with each problem as it comes.

### Talk at GRC

July 24, 2013

I’m currently in Mt. Snow, Vermont to give a talk at the Gordon Research Conference on Computer Aided Drug Design. Yes, I know nothing about drug design. I am here because the organizer, Anthony Nicholls, asked me to give a pedagogical talk on Bayesian Inference. My slides are here. I only arrived yesterday but the few talks I’ve seen have been quite interesting. One interesting aspect of this conference is that many of the participants are from industry. The evening sessions are meant to be of more general interest. Last night were two talks about how to make science more reproducible. As I’ve posted before, many published results are simply wrong. The very enterprising Elizabeth Iorns has started something called the Reproducibility Initiative. I am not completely clear about how it works but it is part of another entity she started called Science Exchange, which helps to facilitate collaborations with a fee-for-service model. The Reproducibility Initiative piggy backs on Science Exchange by providing a service (for a fee) to validate any particular result. Papers that pass approval get a stamp of approval. It is expected that pharma would be interested in using this service so they can inexpensively check if possible drug targets actually hold up. Many drugs fail at phase three of clinical trials because they’ve been shown to be ineffective and this may be due to the target being wrong to start with.

On a final note, I flew to Albany and drove here. Unlike in the past when I would have printed out a map, I simply assumed that I could use Google Maps on my smart phone to get here. However, Google Maps doesn’t really know where Mt. Snow is. It tried to take me up a dirt road to the back of the ski resort. Also, just after I turned up the road, the phone signal disappeared so I was blind and had no paper backup. I was suspicious that this was the right way to go so I turned back to the main highway in hopes of finding a signal or a gas station to ask for directions. A few miles down Route 9, I finally did get a signal and also found a sign that led me the way. Google Maps still tried to take me the wrong way. I should have followed what I always tell my daughter – Always have a backup plan.

### Most of neuroscience is wrong

May 20, 2013

John Ioannidis has a recent paper in Nature Reviews Neuroscience arguing that many results in neuroscience are wrong. The argument follows his previous papers of why most published results are wrong (see here and here) but emphasizes the abundance of studies with small sample sizes in neuroscience. This both reduces the chances of finding true positives and increases the chances of obtaining false positives. Under powered studies are also susceptible to what is called the “winner’s curse” where the effect sizes of true positives are artificially amplified. My take is that any phenomenon with a small effect should be treated with caution even if it is real. If you really wanted to find what causes a given disease then you probably want to find something that is associated with all cases, not just in a small percentage of them.

### Bayesian model comparison Part 2

May 11, 2013

In a previous post, I summarized the Bayesian approach to model comparison, which requires the calculation of the Bayes factor between two models. Here I will show one computational approach that I use called thermodynamic integration borrowed from molecular dynamics. Recall, that we need to compute the model likelihood function

$P(D|M)=\int P((D|M,\theta)P(\theta|M) d\theta$     (1)

for each model where $P(D|M,\theta)$ is just the parameter dependent likelihood function we used to find the posterior probabilities for the parameters of the model.

The integration over the parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition function

$Z(\beta) = \int P(D|M,\theta)^\beta P(\theta| M) d\theta$    (2)

where $\beta$ is an inverse temperature. The derivative of the log of the partition function gives

$\frac{d}{d\beta}\ln Z(\beta)=\frac{\int d\theta \ln[P(D |\theta,M)] P(D | \theta, M)^\beta P(\theta|M)}{\int d\theta \ P(D | \theta, M)^\beta P(\theta | M)}$    (3)

which is equal to the ensemble average of $\ln P(D|\theta,M)$. However, if we assume that the MCMC has reached stationarity then we can replace the ensemble average with a time average $\frac{1}{T}\sum_{i=1}^T \ln P(D|\theta, M)$.  Integrating (3) over $\beta$ from 0 to 1 gives

$\ln Z(1) = \ln Z(0) + \int \langle \ln P(D|M,\theta)\rangle d\beta$

From (1) and (2), we see that  $Z(1)=P(D|M)$, which is what we want to compute  and $Z(0)=\int P(\theta|M) d\theta=1$.

Hence, to perform Bayesian model comparison, we simply run the MCMC for each model at different temperatures (i.e. use $P(D|M,\theta)^\beta$ as the likelihood in the standard MCMC) and then integrate the log likelihoods $Z(1)$ over $\beta$ at the end. For a Gaussian likelihood function, changing temperature is equivalent to changing the data “error”. The higher the temperature the larger the presumed error. In practice, I usually run at seven to ten different values of $\beta$ and use a simple trapezoidal rule to integrate over $\beta$.  I can even do parameter inference and model comparison in the same MCMC run.

Erratum, 2013-5-2013,  I just fixed an error in the final formula

### New paper on finite size effects in spiking neural networks

January 25, 2013

Michael Buice and I have finally published our paper entitled “Dynamic finite size effects in spiking neural networks” in PLoS Computational Biology (link here). Finishing this paper seemed like a Sisyphean ordeal and it is only the first of a series of papers that we hope to eventually publish. This paper outlines a systematic perturbative formalism to compute fluctuations and correlations in a coupled network of a finite but large number of spiking neurons. The formalism borrows heavily from the kinetic theory of plasmas and statistical field theory and is similar to what we used in our previous work on the Kuramoto model (see here and  here) and the “Spike model” (see here).  Our heuristic paper on path integral methods is  here.  Some recent talks and summaries can be found here and here.

### Prediction requires data

November 8, 2012

Nate Silver has been hailed in the media as a vindicated genius for correctly predicting the election. He was savagely attacked before the election for predicting that Obama would win handily. Kudos also go to Sam Wang, Pollester.com, electoral-vote.com, and all others who simply took the data obtained from the polls seriously. Hence, the real credit should go to all of the polling organizations for collectively not being statistically biased.  It didn’t matter if single organizations were biased one way or the other as long as they were not correlated in their biases. The true power of prediction in this election was that the errors of the various pollsters were independently distributed. However, even if you didn’t take the data at face value, you could still reasonably predict the election. Obama had an inherent advantage because he had more paths to winning 270 electoral votes. Suppose there were 8 battleground states and Romney needed to win at least 6 of them. Hence, Romney had 28 ways to win while Obama had 228 ways to win. If the win probability was approximately a half in each of these states, which is what a lot of people claimed,  then Romney has slightly more than one in ten chance of winning, which is close to the odds given by Sam. The only way Romney’s odds would increase is if the state results were correlated in his favour. However, it would take a lot of correlated bias to predict that Romney was a favourite.

Erratum, Nov 9 2012:  Romney actually has 37 ways and Obama 219 in my example.  The total must add up to 2^8=256.  I forgot to include the fact that Romney could also win 7 of 8 states or all states in his paths to winning.

### Predicting the election

November 3, 2012

The US presidential election on Nov. 6 is expected to be particularly close. The polling has been vigorous and there are many statistical prediction web sites. One of them, the Princeton Election Consortium, is run by neuroscientist Sam Wang at Princeton University. For any non-American readers, the president is not elected directly by the citizens but through what is called the electoral college.  This is a set of 538 electoral voters that are selected by the individual states. The electoral votes are allotted according to the number of congressional districts per state plus two. Hence, low population states are over-represented. Almost all states agree that the candidate that takes the plurality of the votes in that state wins all the electoral votes of that state. Maine and Nebraska are the two exceptions that allot electoral votes according to who wins the congressional district. Thus in order to predict who will win, one must predict who will get at least 270 electoral votes. Most of the states are not competitive so the focus of the candidates (and media) are on a handful of so-called battleground states like Ohio and Colorado. Currently, Sam Wang predicts that President Obama will win the election with a median of 319 votes. Sam estimates the Bayesian probability for Obama’s re-election to be 99.6%. Nate Silver at another popular website (Five Thirty Eight), predicts that Obama will win 305 electoral votes and has a re-election probability of 83.7%.

These estimates are made by using polling data with a statistical model. Nate Silver uses national and state polls along with some economic indicators, although the precise model is unknown. Sam Wang uses only state polls. I’ll describe his method here. The goal is to estimate the probability distribution for the number of electoral votes a specific candidate will receive. The state space consists of $2^{51}$ possibilities (50 states plus the District of Columbia). I will assume that Maine and Nebraska do not split their votes along congressional districts although it is a simple task to include that possibility. Sam assumes that the individual states are statistically independent so that the joint probability distribution factorizes completely. He then takes the median of the polls for each state over some time window to represent the probability of that given state. The polling data is comprised of the voting preferences of a sample for a given candidate. The preferences are converted into probabilities using a normal distribution. He then computes the probability for all $2^{51}$ combinations. Suppose that there are just two states with win probabilities for your candidate of $p_1$ and $p_2$. The probability of your candidate winning both states is $p_1 p_2$, state 1 but not state 2 is $p_1(1-p_2)$, and so forth.  If the states have $EV_1$ and $EV_2$ electoral votes respectively then if they win both states they will win $EV_1+EV_2$ votes and so forth. To keep the bookkeeping simple, Sam uses the trick of expressing the probability distribution as a polynomial of a dummy variable $x$.  So the probability distribution is

$(p_1 x^{EV_1} + 1-p_1)(p_2 x^{EV_2} + 1-p_2)$

$= p_1 p_2 x^{EV_1+EV_2} + p_1(1-p_2) x^{EV_1} + (1-p_1)p_2 x^{EV_2} + (1-p_1)(1-p_2)$

Hence, the coefficient of each term is the probability for the number of electoral votes given by the exponent of $x.$The expression for 51 “states” is  $\prod_{i=1}^{51} (p_i x^{EV_i} + 1-p_i)$ and this can be evaluated quickly on a desktop computer. One can then take the median or mean of the distribution for the predicted  number of electoral votes. The sum of the probabilities for electoral votes greater than 269 gives the winning probability, although Sam uses a more sophisticated method for his predicted probabilities. The model does assume that the probabilities are independent.  Sam tries to account for this by using what he calls a meta-margin, in which he calculates how much the probabilities (in terms of preference) need to move for the leading candidate to lose. Also, the state polls will likely pick up any correlations as the election gets closer.

Most statistical models predict that Obama will be re-elected with fairly high probability but the national polls are showing that the race is almost tied. This discrepancy is a puzzle.  Silver’s hypothesis for why is here and Sam’s is here.  One of the sources for error in polls is that they must predict who will vote.  The 2008 election had a voter turnout of a little less than 62%. That means that an election can be easily won or lost based on turnout alone, which makes one wonder about democracy.

Nov 4: dead link is updated

### A new strategy for the iterated prisoner’s dilemma game

September 4, 2012

The game theory world was stunned recently when Bill Press and Freeman Dyson found a new strategy to the iterated prisoner’s dilemma (IPD) game. They show how you can extort an opponent such that the only way they can maximize their payoff is to give you an even higher payoff. The paper, published in PNAS (link here) with a commentary (link here), is so clever and brilliant that I thought it would be worthwhile to write a pedagogical summary for those that are unfamiliar with some of the methods and concepts they use. This paper shows how knowing a little bit of linear algebra can go a really long way to exploring deep ideas.

In the classic prisoner’s dilemma, two prisoner’s are interrogated separately. They have two choices. If they both stay silent (cooperate) they get each get a year in prison. If one confesses (defects) while the other stays silent then the defector is released while the cooperator gets 5 years.  If both defect then they both get 3 years in prison. Hence, even though the highest utility for both of them is to both cooperate, the only logical thing to do is to defect. You can watch this played out on the British television show Golden Balls (see example here). Usually the payout is expressed as a reward so if they both cooperate they both get 3 points, if one defects and the other cooperates then the defector gets 5 points and the cooperator gets zero,  and if they both defect they both get 1  point each. Thus, the combined reward is higher if they both cooperate but since they can’t trust their opponent it is only logical to defect and get at least 1 point.

The prisoner’s dilema changes if you play the game repeatedly because you can now adjust to your opponent and it is not immediately obvious what the best strategy is. Robert Axelrod brought the IPD to public attention when he organized a tournament three decades ago. The results are published in his 1984 book The Evolution of Cooperation.  I first learned about the results in Douglas Hofstader’s Metamagical Themas column in Scientific American in the early 1980s. Axelrod invited a number of game theorists to submit strategies to play IPD and the winner submitted by Anatol Rappaport was called tit-for-tat, where you always cooperate first and then do whatever your opponent does.  Since this was a cooperative strategy with retribution, people have been using this example of how cooperation could evolve ever since those results. Press and Dyson now show that you can win by being nasty. Details of the calculations are below the fold.

### What’s your likelihood?

August 10, 2012

Different internal Bayesian likelihood functions may be why we disagree. The recent shooting tragedies in Colorado and Wisconsin have set off a new round of arguments about gun control. Following the debate has made me realize that the reason the two sides can’t agree (and why different sides of almost every controversial issue can’t agree) is that their likelihood functions are completely different. The optimal way to make inferences about the world is to use Bayesian inference and there is some evidence that we are close to optimal in some circumstances. Nontechnically, Bayesian inference is a way to update  the strength of your belief in something (i.e. probability) given new data. What you do is to combine your prior probability with the likelihood of the data given your internal model of the issue (and then normalize to get a posterior probability). For a more technical treatment of Bayesian inference, see here. I posted previously (see here) that I thought that drastic differences in prior probabilities is why people don’t seem to update their beliefs when faced with overwhelming evidence to the contrary.  However, I’m starting to realize that the main reason might be that they have completely different models of the world, which in technical terms is their likelihood function.

Consider the issue of gun control.  The anti-gun control side argue that “guns don’t kill people, people kill people’ and that restricting access to guns won’t prevent determined malcontents from coming up with some means to kill. The pro-gun control side argues that the amount of gun violence is inversely proportional to the ease of access to guns. After all, you would be hard pressed to kill twenty people in a movie theatre with a sword. The difference in these two viewpoints can be summarized by their models of the world.  The anti-gun control people believe that the distribution of the will of people who would commit violence looks like this

where the horizontal line represents a level of gun restriction.  In this world view, no amount of gun restriction would prevent these people from undertaking their nefarious designs.  On the other hand, the pro-gun control side believes that the same distribution looks like this

in which case, the higher you set the barrier the fewer the number of crimes committed. Given these two views of the world, it is clear why new episodes of gun violence like the recent ones in Colorado and Wisconsin do not change people’s minds. What you would need to do is to teach a new likelihood function to the other side and that may take decades if at all.

### Log normal

May 25, 2012

A comment to my previous post correctly points out that the income distribution is approximately log-normal. What this means is that while income itself is not normally distributed, the logarithm of income is.  The log-normal distribution has a pretty fat tail for high incomes. A variable will be log-normal if it is the product of a lot of random variables, since the log of a product is a sum. It has been argued for many years that achievement should be log-normal because it involves the product of many independent events. This is why a good programmer can be hundreds of times better than a mediocre one.  I even gave a version of this argument here. Hence, small differences in innate ability can lead to potentially large differences in outcome. However, despite the fact that income may deviate from log-normality in some cases and in particular between sectors of the economy (e.g. finance vs. philosophy), there is still a question of whether the compensation scheme needs to follow log-normal even if productivity does. After all, if small differences in innate ability are magnified to such a large extent, one could argue that income should be pegged to the log of productivity.

### Nonlinearity in your wallet

May 25, 2012

Many human traits like height, IQ, and 50 metre dash times are very close to being normally distributed. The normal distribution (more technically the normal probability density function) or Gaussian function

$f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/2\sigma^2}$

is the famous bell shaped curve that the histogram of class grades fall on. The shape of the Gaussian is specified by two parameters the mean $\mu$, which coincides with the peak of the bell, and the standard deviation $\sigma$, which is a measure of how wide the Gaussian is. Let’s take height as an example. There is a 68% chance that any person will be within one standard deviation of the mean and a little more than 95% that you will be within two standard deviations. The tallest one percent are about 2.3 standard deviations from the mean.

The fact that lots of things are normally distributed  is not an accident but a consequence of the central limit theorem (CLT), which may be the most important mathematical law in your life. The theorem says that the probability distribution of a sum of a large number of random things will be normal (i.e. a Gaussian). In the example of height, it suggests that there are perhaps hundreds or thousands of genetic and environmental factors that determine your height, each contributing a little amount. When you add them together you get your height and the distribution is normal.

Now, the one major thing in your life that bucks the normal trend is income and especially wealth distribution. Incomes are extremely non-normal. They have what are called fat tails, meaning that the income of the top earners are much higher than would be expected by a normal distribution. A general rule of thumb called the Pareto Principle is that 20% of the population controls 80% of the wealth. It may even be more skewed these days.

There are many theories as to why income and wealth is distributed the way it is and I won’t go into any of these. What I want to point out is that whatever it is that governs income and wealth, it is definitely nonlinear. The key ingredient in the CLT is that the factors add linearly. If there were some nonlinear combination of the variables then the result need not be normal. It has been argued that some amount of inequality is unavoidable given that we are born with unequal innate traits but the translation of those differences into  income inequality is a social choice to some degree. If we rewarded the contributors to income more linearly, then incomes would be distributed more normally (there would be some inherent skew because incomes must be positive). In some sense, the fact that some sectors of the economy seem to have much higher incomes than other sectors implies a market failure.

### Known unknown unknowns

April 20, 2012

I  listened today to the podcast of science fiction writer Bruce Sterling’s Long Now Foundation talk from 2004 on “The Singularity: Your Future as a Black Hole”.  The talk is available here.  Sterling describes some of the ideas of  mathematician and science fiction writer Vernor Vinge’s conception of the singularity as a scary moment in time where super human intelligence ends the human era and we have no way to predict what will happen.  I won’t address the issue of whether or not such a moment in time will or not happen in the near future or ever.  I’ve posted about it in the past (e.g. see here, here and here).  What I do want to discuss is whether or not there can exist events or phenomena that are so incomprehensible that it will reduce us to a quivering pile of mush.  I think an excellent starting point is former US Secretary of Defense, Donald Rumsfeld’s infamous speech from 2002 regarding the link between Iraq and weapons of mass destruction prior to the Iraq war, where he said:

[T]here are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – there are things we do not know we don’t know.

While Rumsfeld was mocked by the popular media for this seemingly inane statement, I actually think (geopolitical consequences aside) that it was the deepest thing I ever heard him say. Rumsfeld is a Bayesian! There is a very important distinction between known unknowns and unknown unknowns.  In the first case, we can assign a probability to the event.  In the second we cannot.  Stock prices are known unknowns, while black swans are unknown unknowns.  (Rumsfeld’s statement predates Nassim Taleb’s book.)  My question is not whether we can predict black swans (by definition we cannot) but whether something can ever occur that we wouldn’t even be able to describe it much less understand it.

In Bayesian language, a known unknown would be any event for which a prior probability exists.    Unknown unknowns are events for which there is no prior.  It’s not just that the prior is zero, but also that it is not included in our collection (i.e. sigma algebra) of possibilities. Now, the space of all possible things is uncountably infinite so it seems likely that there will be things that you cannot imagine. However, I claim that by simply acknowledging that there exist things that I cannot possible ever imagine, is sufficient to remove the surprise. We’ve witnessed enough in the post-modern world, to assign a nonzero prior to the possibility that anything can and will happen. That is not to say that we won’t be very upset or disturbed by some event. We may read about some horrific act of cruelty tomorrow that will greatly perturb us but it won’t be inconceivably shocking.  The entire world could vanish tomorrow and be replaced by an oversized immersion blender and while I wouldn’t be very happy about it and would be extremely puzzled by how it happened, I would not say that it was impossible. Perhaps I won’t be able to predict what will happen after the singularity arrives but I won’t be surprised by it.

### Talk at MBI

February 23, 2012

I’m currently about to give a talk at a workshop  on statistical inference in biology at the Mathematical Biosciences Institute at Ohio State.  My talk is a variation of previous ones on using Bayesian methods for parameter estimation and model comparison.  The slides are here.

### New paper in Biophysical Journal

February 14, 2012

Bayesian Functional Integral Method for Inferring Continuous Data from Discrete Measurements

Biophysical Journal, Volume 102, Issue 3, 399-406, 8 February 2012

doi:10.1016/j.bpj.2011.12.046

William J. Heuett, Bernard V. Miller, Susan B. Racette, John O. Holloszy, Carson C. Chow, and Vipul Periwal

Abstract: Inference of the insulin secretion rate (ISR) from C-peptide measurements as a quantification of pancreatic β-cell function is clinically important in diseases related to reduced insulin sensitivity and insulin action. ISR derived from C-peptide concentration is an example of nonparametric Bayesian model selection where a proposed ISR time-course is considered to be a “model”. An inferred value of inaccessible continuous variables from discrete observable data is often problematic in biology and medicine, because it is a priori unclear how robust the inference is to the deletion of data points, and a closely related question, how much smoothness or continuity the data actually support. Predictions weighted by the posterior distribution can be cast as functional integrals as used in statistical field theory. Functional integrals are generally difficult to evaluate, especially for nonanalytic constraints such as positivity of the estimated parameters. We propose a computationally tractable method that uses the exact solution of an associated likelihood function as a prior probability distribution for a Markov-chain Monte Carlo evaluation of the posterior for the full model. As a concrete application of our method, we calculate the ISR from actual clinical C-peptide measurements in human subjects with varying degrees of insulin sensitivity. Our method demonstrates the feasibility of functional integral Bayesian model selection as a practical method for such data-driven inference, allowing the data to determine the smoothing timescale and the width of the prior probability distribution on the space of models. In particular, our model comparison method determines the discrete time-step for interpolation of the unobservable continuous variable that is supported by the data. Attempts to go to finer discrete time-steps lead to less likely models.

### St. Petersburg Paradox

November 13, 2011

The St. Petersburg Paradox is a problem in economics that was first proposed by Nicolas Bernoulli in 1713 in a letter.  It involves a lottery where you buy a ticket to play a game where a coin is flipped until heads comes up.  If heads comes up on the $n$th toss you get $2^{n-1}$ dollars.  So if heads comes up on the first toss you get one dollar and if it comes up on the fourth you would get 8 dollars.  The question is how much would you pay for a ticket to play this game.  In economics theory, the idea is that you would play if the expectation value of  the payout minus the ticket price is positive. The paradox for this game is that the expectation value of the payout is infinite but most people would pay no more than ten dollars.  The solution to the paradox  has been debated for the past three centuries.  Now, physicist Ole Peters argues that everyone before has missed a crucial  point and provides a new resolution to the paradox.  Peters also shows that a famous paper by Karl Menger in 1934 about this problem contains two critical errors that nullify Menger’s results.  I’ll give a summary of the mathematical analysis below, including my even simpler resolution.

The reason the expectation value of the payout is infinite is that the distribution is not normalizable. This can be seen easily because while the probability of getting $n$ heads in a row decreases exponentially as $p(n)=(1/2)^n$, the payout increases exponentially as $S(n)=2^{n-1}$. The product is always 1/2 and never decays.  The expectation value is thus

$E[S]=\sum_{n=1}^\infty p(n)S(n) = 1/2+1/2 + \cdots$

and diverges.  The first proposed resolution of the paradox was by Daniel Bernoulli in a 1738  paper submitted to the Commentaries of the Imperial Academy of Science of St. Petersburg, from which the paradox received its name.  Bernoulli’s suggestion was that people don’t really value money linearly and proposed a utility function $U(S) = \log S$, so the utility of money decreases with wealth. Given that this now grows sub-exponentially, the expectation value of $U(S)$ is thus finite and resolves the paradox.  People have always puzzled over this solution because it seems ad hoc.  Why should my utility function be the same as someone else’s?  Menger suggested that he could always come up with an even faster growing payout function to make the expectation value still divergent and declared that all utility functions must be bounded.  According to Peters, this has affected the course of economics for the twentieth century and may have led to more risk taking than warranted mathematically.

Peter’s resolution is that the expectation value of the Bernoulli utility function is actually the time average of  the growth rate in wealth of a person that plays repeatedly.  Hence, if they pay more than a certain price they would certainly become bankrupt.  The proof is quite simple.  The factor by which a person’s wealth at round $i$ changes is given by the expression

$r_i = \frac{W_i-C+S_i}{W_i}$

where $W_i$ is the wealth, $C$ is the cost to play, and $S_i$ is the payout at round $i$.  The total fractional change after $T$ rounds is thus $\bar{r}_T=(\prod_{i=1}^T r_i)^{1/T}$.  Now transform from rounds of the game played into $n$ the number of tosses until the first heads.  This brings in the number of the occurrence of $n$, $k_n$ to yield

$\bar{r}_T= \prod_{n=1}^{n_{\infty}} r_n^{k_n/T}=\prod_{n=1}^{n_{\infty}} r_n^{p_n}$,

where $p_n$ is the probability for $n$ tosses.  The average growth rate is given by taking the log, which gives the expression

$\sum_{n=1}^\infty \left(\frac{1}{2}\right)^n (\ln(W-c+2^{n-1})-\ln W)$

which is equivalent to the Bernoulli solution without the need for a utility function.

Now my solution, which has probably been proposed previously, is that we don’t really evaluate the expectation value of the payout but we take the payout of the expected number of tosses, which is a finite amount.  Thus we replace $E(S(n))$ with $S(E(n))$, where

$E(n) =\sum_{n=1}^\infty n \left(\frac{1}{2}\right)^n=2$,

which means we wouldn’t really want to play for more than 2 dollars. This might be a little conservative but it’s what I would do.

### Evolution of overconfidence

July 31, 2011

A new paper on the evolution of overconfidence (arXiv:0909.4043v2) will appear shortly in Nature. (Hat tip to J.L. Delatre). It is well known in psychology that people generally overvalue themselves and it has always been a puzzle as to why.  This paper argues that under certain plausible conditions, it may have been evolutionarily advantageous to be overconfident.  One of the authors is James Fowler who has garnered recent fame for claiming with Nicholas Christakis that medically noninfectious phenomena such as obesity and divorce are socially contagious.  I have always been skeptical of these social network results and it seems like  there has been some recent push back.  Statistician and blogger Andrew Gelman has a summary of the critiques here.  The problem with these papers  fall in line with the same problems of many other clinical papers that I have posted on before (e.g. see here and here).  The evolution of overconfidence paper does not rely on statistics but on a simple evolutionary model.

The model  considers competition between two parties for some scarce resource.  Each party possess some heritable attribute and the one with the higher value of that attribute will win a contest and obtain the resource.   The model allows for three outcomes in any interaction: 1) winning a competition and obtaining the resource with value W-C (where C is the cost of competing), 2) claiming the resource without a fight with value W, and 3) losing a competition with a value -C.    The parties assess their own and their opponents attributes before deciding to compete.  If both parties had perfect information, participating in a contest would be unnecessary.  Both parties would realize who would win and the stronger of the two would claim the prize. However,  because of the error and biases in assessing attributes, resources will be contested. Overconfidence is represented as a positive bias in assessing oneself.  The authors chose a model that was simple enough to explicitly evaluate the outcomes of all possible situations and show that when the reward for winning is sufficiently large compared to the cost, then overconfidence is evolutionarily stable.

Here I will present a simpler toy model of why the result is plausible. Let P be the probability that a given party will win a competition on average and let Q be the probability that they will engage in a competition. Hence, Q is a measure of overconfidence.  Using these values, we can then compute the expectation value of an interaction:

$E = Q^2P (W-C) + Q(1-Q) W - Q^2(1-P) C$

(i.e. the probability of a competition and winning is $Q^2P$, the probability of  winning and not having to fight is $Q(1-Q)$, the probability of  losing a competition is $Q^2(1-P)$, and it doesn’t cost anything to not compete.)  The derivative of E with respect to Q is

$E' = 2 QP(W-C) + (1-2Q)W-2Q(1-P)C=2Q[(1-P)W-C]+W$

Hence, we see that if $(1-P)W > C$, i.e. the reward of winning sufficiently exceeds the cost of competing, then the expectation value is guaranteed to increase with increasing confidence. Of course this simple demonstration doesn’t prove that overconfidence is a stable strategy but it does affirm Woody Allen’s observation that “95% of life is just showing up.”

### Judicial system versus Bayesian brain

July 11, 2011

I think the recent uproar over the acquittal of  Casey Anthony clearly shows how our internal system of inference can be at odds with the American judicial system.  For those of you who don’t pay attention to the mainstream media, Casey Anthony was a young mother of a toddler that was found dead.  What captivated the American public was that the toddler had been missing for a month before Anthony reported it.  She lied to her parents and the authorities about the whereabouts of the child and even appeared celebratory in public during the period of the child’s disappearance.  In the court of public opinion, Anthony was clearly guilty.  The fact that a mother showed no anxiety whatsoever over the disappearance of a child clearly indicates that she was the culprit.

The American judicial system requires that if there is any reasonable doubt of guilt then a person must be acquitted.  The burden of proof is on the prosecution.  In this case, there were no witnesses and no physical evidence linking Anthony to the death of the child or that the child was even murdered.  Thus there was the remote possibility that Anthony was not responsible for the child’s death but simply took advantage of the situation, as macabre as that may be.  Even though the probability  of these two unlikely events – a mother wishing to be free of her child and a child going missing – is exceedingly low, it is still non zero and thus the jury was forced to acquit.

Now if a single piece of evidence had linked Anthony to the crime, say a fingerprint or DNA sample,  then she most likely would have been found guilty.  The interesting aspect of this is that there is also an equally low probability that someone could have planted the evidence to frame her. Thus, reasonable doubt is not a global quantity according to the law. It is not sufficient that the total probability that the accused is guilty be high, it also matters if it is high in each of several categories, i.e. motive, opportunity, and direct physical evidence or witnesses.  Circumstantial evidence is insufficient to convict a criminal.  However, it appears that our brains do not work this way.  We seem to take the global probability of guilt and go with that.

### Stochastic differential equations

June 5, 2011

One of the things I noticed at the recent Snowbird meeting was an increase in interest in stochastic differential equations (SDEs) or Langevin equations.  They arise wherever noise is involved in a dynamical process.  In some instances, an SDE comes about as the continuum approximation of a discrete stochastic process, like the price of a stock.  In other cases, they arise as a way to reintroduce stochastic effects to mean field differential equations originally obtained by averaging over a large number of stochastic molecules or neurons.  For example, the Hodgkin-Huxley equation describing action potential generation in neurons is a mean field approximation of the stochastic transport of ions (through ion channels) across the cell membrane, which can be modeled as a multi-state Markov process usually simulated with the Gillespie algorithm (for example see here).  This is computationally expensive so adding a noise term to the mean field equations is a more efficient way to account for stochasticity.

### Distribution cheat sheet

February 10, 2011

I can never remember the form of distributions so here is a cheat sheet for the density functions of some commonly used ones.

Binomial: $\theta \sim {\rm Bin}(n,p)$

$\displaystyle p(\theta) = \left( \begin{array}{c} n \\ \theta \end{array}\right)p^\theta(1-p)^{n-\theta}$

Multinomial: $\theta \sim {\rm Multin}(n;p_1,\dots,p_k), \theta_j = 0,1,2,\dots n$

$\displaystyle p(\theta) = \left( \begin{array}{c} n \\ \theta_1\theta_2\cdots\theta_k \end{array}\right)p^\theta_1\cdots p^\theta_k, \quad \sum_{j=1}^k\theta_j = n$

Negative binomial: $\theta \sim \mbox{Neg-bin}(\alpha,\beta)$

$\displaystyle p(\theta)= \left( \begin{array}{c} \theta+\alpha-1 \\ \alpha -1 \end{array}\right)\left(\frac{\beta}{\beta+1}\right)^\alpha \left(\frac{1}{\beta+1}\right)^\theta$

Poisson: $\theta \sim {\rm Poisson}(\lambda)$

$\displaystyle p(\theta)=\frac{\lambda^\theta e^{-\lambda}}{\theta!}$

Normal: $\theta \sim {\rm N}(\mu,\sigma^2)$

$\displaystyle p(\theta)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp{\frac{-(\theta-\mu)^2}{2\sigma^2}}$

Multivariate normal: $\theta \sim {\rm N}(\mu,\Sigma), \theta \in R^d$

$\displaystyle p(\theta)=(2\pi)^{-d/2}|\Sigma|^{-1/2}\exp{\left(-\frac{1}{2}(\theta-\mu)^T\Sigma^{-1}(\theta-\mu)\right)}$

Gamma: $\theta \sim {\rm Gamma}(\alpha,\beta)$

$\displaystyle p(\theta) = \frac{ \beta^\alpha\theta^{\alpha-1} e^{-\beta\theta}}{\Gamma(\alpha)}$

Inverse gamma: $\theta \sim \mbox{Inv-gamma} (\alpha,\beta)$

$\displaystyle p(\theta) = \frac{ \beta^{\alpha} \theta^{-(\alpha +1)} e^{-\beta/\theta}}{\Gamma(\alpha)}$

Beta: $\theta \sim {\rm Beta}(\alpha,\beta)$

$\displaystyle p(\theta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha -1}(1-\theta)^{\beta -1}$

Student t: $\theta \sim t_\nu(\mu,\sigma^2)$

$p(\theta)=\frac{\Gamma((\nu+1)/2)}{\Gamma(\nu/2)\sqrt{\nu\pi\sigma}}\left(1+\frac{1}{\nu}\left(\frac{\theta-\mu}{\sigma}\right)^2\right)^{-(\nu+1)/2}$

Cauchy: $\theta \sim t_1(\mu,\sigma^2)$

Chi-square: $\theta \sim \chi_\nu^2 = {\rm Gamma}(\nu/2,1/2)$