Chat GPT and the end of human culture

I had a personal situation this past year that kept me from posting much but today I decided to sit down and write something – all by myself without any help from anyone or anything. I could have enlisted the help of Chat GPT or some other large language model (LLM) but I didn’t. These posts generally start out with a foggy idea, which then take on a life of their own. Part of my enjoyment of writing these things is that I really don’t know what they will say until I’m finished. But sometime in the near future I’m pretty sure that WordPress will have a little window where you can type an idea and a LLM will just write the post for you. At first I will resist using it but one day I might not feel well and I’ll try it and like it and eventually all my posts will be created by a generative AI. Soon afterwards, the AI will learn what I like to blog about and how often I do so and it will just start posting on it’s own without my input. Maybe most or all content will be generated by an AI.

These LLMs are created by training a neural network to predict the next word of a sentence, given the previous words, sentences, paragraphs, and essentially everything that has ever been written. The machine is fed some text and produces what it thinks should come next. It then compares its prediction with the actual answer and updates its settings (connection weights) based on some score of how well it did. When fed the entire corpus of human knowledge (or at least what is online), we have all seen how well it can do. As I have speculated previously (see here), this isn’t all too surprising given that the written word is relatively new in our evolutionary history. Thus, humans aren’t really all that good at it and there isn’t all that much variety in what we write. Once an AI has the ability to predict the next word, it doesn’t take much more tinkering to make it generate an entire text. The specific technology that made this generative leap is called a diffusion model, which I may describe in more technical detail in the future. But in the simplest terms, the model finds successive small modifications to transform the initial text (or image or anything) into pure noise. The model can then be run backwards starting from random noise to create text.

When all content is generated by AI, the AI will no longer have any human data on which to further train. Human written culture will then be frozen. The written word will just consist of rehashing of previous thoughts along with random insertions generated by a machine. If the AI starts to train on AI generated text then it could leave human culture entirely. Generally, when these statistical learning machines train on their own generated data they can go unstable and become completely unpredictable. Will the AI be considered conscious by then?

The probability of extraterrestrial life

Since, the discovery of exoplanets nearly 3 decades ago most astronomers, at least the public facing ones, seem to agree that it is just a matter of time before they find signs of life such as the presence of volatile gases in the atmosphere associated with life like methane or oxygen. I’m an agnostic on the existence of life outside of earth because we don’t have any clue as to how easy or hard it is for life to form. To me, it is equally possible that the visible universe is teeming with life or that we are alone. We simply do not know.

But what would happen if we find life on another planet. How would that change our expected probability for life in the universe? MIT astronomer Sara Seager once made an offhand remark in a podcast that finding another planet with life would make it very likely there were many more. But is this true? Does the existence of another planet with life mean a dramatic increase in the probability of life in the universe. We can find out by doing the calculation.

Suppose you believe that the probability of life on a planet is f (i.e. fraction of planets with life) and this probability is uniform across the universe. Then if you search n planets, the probability for the number of planets with life you will find is given by a Binomial distribution. The probability that there are x planets is given by the expression P(x | f) = C(x,n) f^x(1-f)^{n-x}, where C is a factor (the binomial coefficient) such that the sum of x from one to n is 1. By Bayes Theorem, the posterior probability for f (yes, that would be the probability of a probability) is given by

P(f | x) = \frac{ P(x | f) P(f)}{P(x)}

where P(x) = \int_0^1 P(x | f) P(f)  df. As expected, the posterior depends strongly on the prior. A convenient way to express the prior probability is to use a Beta distribution

P(f |\alpha, \beta) = B(\alpha,\beta)^{-1} f^{\alpha-1} (1-f)^{\beta-1} (*)

where B is again a normalization constant (the Beta function). The mean of a beta distribution is given by E(f) =  \alpha/(\alpha + \beta) and the variance, which is a measure of uncertainty, is given by Var(f) = \alpha \beta /(\alpha + \beta)^2 (\alpha + \beta + 1). The posterior distribution for f after observing x planets with life out of n will be

P(f | x) = D f^{\alpha + x -1} (1-f)^{n+\beta - x -1}

where D is a normalization factor. This is again a Beta distribution. The Beta distribution is called the conjugate prior for the Binomial because it’s form is preserved in the posterior.

Applying Bayes theorem in equation (*), we see that the mean and variance of the posterior become (\alpha+x)/(\alpha + \beta  +n) and (\alpha+x)( \beta+n-x) /(\alpha + \beta + n)^2 (\alpha + \beta + n + 1), respectively. Now let’s consider how our priors have updated. Suppose our prior was \alpha = \beta = 1, which gives a uniform distribution for f on the range 0 to 1. It has a mean of 1/2 and a variance of 1/12. If we find one planet with life after checking 10,000 planets then our expected f becomes 2/10002 with variance 2\times 10^{-8}. The observation of a single planet has greatly reduced our uncertainty and we now expect about 1 in 5000 planets to have life. Now what happens if we find no planets. Then, our expected f only drops to 1 in 10000 and the variance is about the same. So, the difference between finding a planet versus not finding a planet only halves our posterior if we had no prior bias. But suppose we are really skeptical and have a prior with \alpha =0 and \beta = 1 so our expected probability is zero with zero variance. The observation of a single planet increases our posterior to 1 in 10001 with about the same small variance. However, if we find a single planet out of much fewer observations like 100, then our expected probability for life would be even higher but with more uncertainty. In any case, Sara Seager’s intuition is correct – finding a planet would be a game breaker and not finding one shouldn’t really discourage us that much.

Exponential growth

The spread of covid-19 and epidemics in general is all about exponential growth. Although, I deal with exponential growth in my work daily, I don’t have an immediate intuitive grasp for what it means in life until I actually think about it. To me I write down functions with the form e^{rt} and treat them as some abstract quantity. In fact, I often just look at differential equations of the form

\frac{dx}{dt} = r x

for which x = A e^{rt} is a solution, where A is any constant number, because the derivative of an exponential function is an exponential function. One confusing aspect of discussing exponential growth with lay people is that the above equation is called a linear differential equation and often mathematicians or physicists will simply say the dynamics are linear or even the growth is linear, although what we really mean is that the describing equation is a linear differential equation, which has exponential growth or exponential decay if r is negative.

If we let t represent time (say days) then r is a growth rate. What x = e^{rt} actually means is that every day x, e.g. number of people with covid-19, is increasing by a factor of e, which is about 2.718.  Now, while this is the most convenient number for mathematical manipulation, it is definitely not the most convenient number for intuition. A common alternative is to use 2 as the base of the exponent, rather than e. This then means that tomorrow you will have double what you have today. I think 2 doesn’t really convey how fast exponential growth is because you say well I start with 1 today and then tomorrow I have 2 and the day after 4 and like the famous Simpson’s episode when Homer is making sure that Bart is a name safe from mockery, goes through the alphabet and says “Aart, Bart, Cart, Dart, Eart, okay Bart is fine,”  you will have to go to day 10 before you get to a thousand fold increase (1024 to be precise), which still doesn’t seem too bad. In fact 30 days later is 2 to the power of 30 or 2^{30}, which is only 10 million or so. But on day 31 you have 20 million and day 32 you have 40 million. It’s growing really fast.

Now, I think exponential growth really hits home if you use 10 as a base because now you simply add a zero for every day: 1, 10, 100, 1000, 10000, 100000. Imagine if your stock was increasing by a factor of 10 every day. Starting from a dollar you would have a million dollars by day 6 and a billion by day 9, be richer than Bill Gates by day 11, and have more money than the entire world by the third week. Now, the only difference between exponential growth with base 2 versus base 10 is the rate of growth. This is where logarithms are very useful  because they are the inverse of an exponential. So, if x = 2^3 then \log_2 x = 3 (i.e. the log of x in base 2 is 3). Log just gives you the power of an exponential (in that base). So the only difference between using base 2 versus base 10 is the rate of growth. For example 10^x = 2^{(log_2 10) x}, where \log_2 (10) = 3.3.  So an exponential process that is doubling every day is increasing by a factor of ten every 3.3 days.

The thing about exponential growth is that most of the action is in the last few days. This is probably the hardest part to comprehend.  So the reason they were putting hospital beds in the convention center in New York weeks ago is because if the number of covid-19 cases is doubling every 5 days then even if you are 100 fold under capacity today, you will be over capacity in 7 doublings or 35 days and the next week you will have twice as many as you can handle.

Flattening the curve means slowing the growth rate. If you can slow the rate of growth by a half then you have 70 days before you get to 7 doublings and max out capacity. If you can make the rate of growth negative then the number of cases will decrease and the epidemic will flame out. There are two ways you can make the growth rate go negative. The first is to use external measures like social distancing and the second is to reach nonlinear saturation, which I’ll discuss in a later post. This is also why social distancing measures seem so extreme and unnecessary because you need to apply it before there is a problem and if it works then those beds in the convention center never get used. It’s a no win situation, if it works then it will seem like an over reaction and if it doesn’t then hospitals will be overwhelmed. Given that 7 billion lives and livelihoods are at stake, it is not a decision to be taken lightly.

 

New paper on GWAS

 2018 Dec;42(8):783-795. doi: 10.1002/gepi.22161. Epub 2018 Sep 24.

The accuracy of LD Score regression as an estimator of confounding and genetic correlations in genome-wide association studies.

Author information

1
Department of Psychology, University of Minnesota Twin Cities, Minneapolis, Minnesota.
2
Mathematical Biology Section, Laboratory of Biological Modeling, NIDDK, National Institutes of Health, Bethesda, Maryland.

Abstract

To infer that a single-nucleotide polymorphism (SNP) either affects a phenotype or is linkage disequilibrium with a causal site, we must have some assurance that any SNP-phenotype correlation is not the result of confounding with environmental variables that also affect the trait. In this study, we study the properties of linkage disequilibrium (LD) Score regression, a recently developed method for using summary statistics from genome-wide association studies to ensure that confounding does not inflate the number of false positives. We do not treat the effects of genetic variation as a random variable and thus are able to obtain results about the unbiasedness of this method. We demonstrate that LD Score regression can produce estimates of confounding at null SNPs that are unbiased or conservative under fairly general conditions. This robustness holds in the case of the parent genotype affecting the offspring phenotype through some environmental mechanism, despite the resulting correlation over SNPs between LD Scores and the degree of confounding. Additionally, we demonstrate that LD Score regression can produce reasonably robust estimates of the genetic correlation, even when its estimates of the genetic covariance and the two univariate heritabilities are substantially biased.

KEYWORDS:

causal inference; genetic correlation; heritability; population stratification; quantitative genetics

PMID:

 

30251275

 

DOI:

 

10.1002/gepi.22161

The low carb war continues

Last month, a paper in the British Journal of Medicine on the effect of low carb diets on energy expenditure, with senior author David Ludwig, made a big splash in the popular press and also instigated a mini-Twitter war. The study, which cost somewhere in the neighborhood of 12 million dollars, addressed the general question of whether a person will burn more energy on a low carbohydrate diet compared to an average or high carb diet. In particular, the study looked at the time period after weight loss where people are susceptible to regaining weight. The argument is that it will be easier to maintain weight loss on a low carb diet since you will be burning more energy. Recent intensive studies by my colleague Kevin Hall and others have found that low carb diets had little effect if any on energy expenditure, so this paper was somewhat of a surprise and gave hope to low carb aficionados. However, Kevin found some possible flaws, which he points out in an official response to BMJ and a BioRxiv paper, which then prompted a none-too-pleased response from Ludwig, which you can follow on Twitter. The bottom line is that the low carb effect size depends on the baseline point you compare too. In the original study plan, the baseline point was chosen to be energy expenditure prior to the weight loss phase of the study. In the publication, the baseline point was changed to after the weight loss but before the weight loss maintenance phase. If the original baseline was chosen, the low carb effect is no longer significant. The authors claim that they were blinded to the data and changed the baseline for technical reasons so this did not represent a case of p-hacking where one tries multiple combinations until something significant turns up. It seems pretty clear to me that low carbs do not have much of a metabolic effect but that is not to say that low carb diets are not effective. The elephant in the room is still appetite. It is possible that you are simply less hungry on a low carb diet and thus you eat less. Also, when you eliminate a whole category of food, there is just less food to eat. That could be the biggest effect of all.

The tragedy of low probability events

We live in an age of fear and yet life (in the US at least) is the safest it has ever been. Megan McArdle blames coddling parents and the media in a Washington Post column. She argues that cars and swimming pools are much more dangerous than school shootings and kidnappings yet we mostly ignore the former and obsess about the latter. However, to me dying from an improbable event is just so much more tragic than dying from an expected one. I would be much less despondent meeting St. Peter at the Pearly Gates if I happened to expire from cancer or heart disease than if I were to be hit by an asteroid while weeding my garden. We are so scared now because we have never been safer. We would fear terrorist attacks less if they were more frequent. This is the reason that I would never want a major increase in lifespan. I most certainly would like to last long enough to see my children become independent but anything beyond that is bonus time. Nothing could be worse to me than immortality. The pain of any tragedy would be unbearable. Life would consist of an endless accumulation of sad memories. The way out is to forget but that to me is no different from death. What would be the point of living forever if you were to erase much of it. What would a life be if you forgot the people and things that you loved? To me that is no life at all.

Sampling from a probability distribution

When simulating any system with randomness, sampling from a probability distribution is necessary. Usually, you’ll just need to sample from a normal or uniform distribution and thus can use a built-in random number generator. However, for the time when a built-in function does not exist for your distribution, here’s a simple algorithm.

Let’s say you have some probability density function (PDF) \rho(x) on some domain [x_{\min},x_{\max}] and you want to generate a set of numbers that follows this probability law. This is actually very simple to do although those new to the field may not know how. Generally, any programming language or environment has a built-in random number generator for real numbers (up to machine precision) between 0 and 1. Thus, what you want to do is to find a way to transform these random numbers, which are uniformly distributed on the domain [0,1] to the random numbers you want. The first thing to notice is that the cumulative distribution function (CDF) for your PDF, P(x) = \int_{x_{\min}}^x \rho(x') dx' is a function that ranges over the interval [0,1], since it is a probability. Thus, the algorithm is: 1) generate a random number r between o and 1, and 2) plug it into the inverse of P(x). The numbers you get out satisfy your distribution (i.e. P(x)=r \Rightarrow x = P^{-1}(r) ).

If you’re lucky enough to have a closed form expression of the inverse of the CDF P^{-1}(r) then you can just plug the random numbers directly into your function, otherwise you have to do a little more work. If you don’t have a closed form expression for the CDF then you can just solve the equation P(y)-r=0 for x. Newton’s method will generally converge very quickly.  You just loop over the command x = x – (P(x)-r)/rho(x) until you reach the tolerance you want. If you don’t have a closed form expression of the CDF then the simplest way is construct the CDF numerically by computing the N dimensional vector v[j]=\sum_{i=0}^j \rho(ih+x_{\min}) h, for some discretization parameter h such that x_{\max}= Nh+x_{\min} .  If the PDF is defined on the entire real line then you’ll have to decide on what you want the minimum x_{\min} and maximum (through N) x_{\max} to be. If the distribution has thin tails, then you just need to go out until the probability is smaller than your desired error tolerance. You then find the index j such that v[j] is closest to r (i.e. in Julia you could use findmin(abs(v[j]-r))) and your random number is x = hj+x_{\min}.

Here is why the algorithm works. What we are doing is transforming the PDF of r, which we can call u(r), to the desired PDF for x. The transformation we use is r = P(x). Thus, u(r) dr = \rho(x) dx or \rho(x) = u(r) dr/dx. Now, u(r) is just 1 on the interval [0,1] and dr/dx is the PDF of x as expected. Thus, inverting the CDF for uniform random numbers always gives you a new set of random numbers that satisfies the new PDF. Trivial perhaps, but really useful.

Erratum: 2019-04-05.  It was pointed out to me that my algorithm involving direct summation of the CDF had some errors.  I hope they have now been corrected.

R vs Matlab

It has now been almost half a year since I switched from Matlab to open source software and I’ve been amazed at how easy the transition has been. I had planned to replace Matlab with Python, Julia, and R but I have found that R and Julia have been sufficient for my requirements. Maxima is also more than adequate to replace Mathematica. I have become particularly attached to R especially after I started to use R Studio as the interface. I had only used R before as just a statistics tool but it really is a complete programming platform like Matlab. It has very nice graphics capabilities and I find the language very nice to program in. I really like its use of lists where I can pile sets of any type and any size into one object. I also like how R Studio can save your work into Projects, which keeps the whole environment and history in one place. I can then switch between multiple projects and everything comes back. The only thing I miss from Matlab is the command completion history feature, where I could easily find a previous command by just typing the first few letters. Also, I haven’t quite figured out how to output R data into a text file seamlessly yet. I seem to always get extraneous row or column information. I use Julia when I want to write a code that needs to loop fast but for everything else I’ve been reaching for R.

New paper on genomics

James Lee and I have a new paper out: Lee and Chow, Conditions for the validity of SNP-based heritability estimation, Human Genetics, 2014. As I summarized earlier (e.g. see here and here), heritability is a measure of the proportion of the variance of some trait (like height or cholesterol levels) due to genetic factors. The classical way to estimate heritability is to regress standardized (mean zero, standard deviation one) phenotypes of close relatives against each other. In 2010, Jian Yang, Peter Visscher and colleagues developed a way to estimate heritability directly from the data obtained in Genome Wide Association Studies (GWAS), sometimes called GREML.  Shashaank Vattikuti and I quickly adopted this method and computed the heritability of metabolic syndrome traits as well as the genetic correlations between the traits (link here). Unfortunately, our methods section has a lot of typos but the corrected Methods with the Matlab code can be found here. However, I was puzzled by the derivation of the method provided by the Yang et al. paper.  This paper is our resolution.  The technical details are below the fold.

 

Continue reading

Paper on compressed sensing and genomics

New paper on the arXiv. The next step after the completion of the Human Genome Project, was the search for genes associated with diseases such as autism or diabetes. However, after spending hundreds of millions of dollars, we find that there are very few common variants of genes with large effects. This doesn’t mean that there aren’t genes with large effect. The growth hormone gene definitely has a large effect on height. It just means that variations of genes that are common among people have small effects on the phenotype. Given the results of Fisher, Wright, Haldane and colleagues, this was probably expected as the most likely scenario and recent results measuring narrow-sense heritability directly from genetic markers (e.g. see this) confirms this view.

Current GWAS microarrays consider about a million or two markers and this is increasing rapidly. Narrow-sense heritability refers to the additive or linear genetic variance, which means the phenotype is given by the linear model y= Z\beta + \eta, where y is the phenotype vector, Z is the genotype matrix, \beta are all the genetic effects we want to recover, and \eta are all the nonadditive components including environmental effects. This is a classic linear regression problem. The problem comes when the number of coefficients \beta far exceeds the number of people in your sample, which is the case for genomics. Compressed sensing is a field of high dimensional statistics that addresses this specific problem. People such as David Donoho, Emmanuel Candes and Terence Tao have proven under fairly general conditions that if the number of nonzero coefficients are sparse compared to the number samples, then the effects can be completely recovered using L1 penalized optimization algorithms such as the lasso or approximate message passing. In this paper, we show that these ideas can be applied to genomics.

Here is Steve Hsu’s summary of the paper

Application of compressed sensing to genome wide association studies and genomic selection

(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

Comments: Main paper (27 pages, 4 figures) and Supplement (5 figures) combined
Subjects: Genomics (q-bio.GN); Applications (stat.AP)
Cite as: arXiv:1310.2264 [q-bio.GN]
(or arXiv:1310.2264v1 [q-bio.GN] for this version)

Most of neuroscience is wrong

John Ioannidis has a recent paper in Nature Reviews Neuroscience arguing that many results in neuroscience are wrong. The argument follows his previous papers of why most published results are wrong (see here and here) but emphasizes the abundance of studies with small sample sizes in neuroscience. This both reduces the chances of finding true positives and increases the chances of obtaining false positives. Under powered studies are also susceptible to what is called the “winner’s curse” where the effect sizes of true positives are artificially amplified. My take is that any phenomenon with a small effect should be treated with caution even if it is real. If you really wanted to find what causes a given disease then you probably want to find something that is associated with all cases, not just in a small percentage of them.

Bayesian model comparison Part 2

In a previous post, I summarized the Bayesian approach to model comparison, which requires the calculation of the Bayes factor between two models. Here I will show one computational approach that I use called thermodynamic integration borrowed from molecular dynamics. Recall, that we need to compute the model likelihood function

P(D|M)=\int P((D|M,\theta)P(\theta|M) d\theta     (1)

for each model where P(D|M,\theta) is just the parameter dependent likelihood function we used to find the posterior probabilities for the parameters of the model.

The integration over the parameters can be accomplished using the Markov Chain Monte Carlo, which I summarized previously here. We will start by defining the partition function

Z(\beta) = \int P(D|M,\theta)^\beta P(\theta| M) d\theta    (2)

where \beta is an inverse temperature. The derivative of the log of the partition function gives

\frac{d}{d\beta}\ln Z(\beta)=\frac{\int d\theta \ln[P(D |\theta,M)] P(D | \theta, M)^\beta P(\theta|M)}{\int d\theta \ P(D | \theta, M)^\beta P(\theta | M)}    (3)

which is equal to the ensemble average of \ln P(D|\theta,M). However, if we assume that the MCMC has reached stationarity then we can replace the ensemble average with a time average \frac{1}{T}\sum_{i=1}^T \ln P(D|\theta, M).  Integrating (3) over \beta from 0 to 1 gives

\ln Z(1) = \ln Z(0) + \int \langle \ln P(D|M,\theta)\rangle d\beta

From (1) and (2), we see that  Z(1)=P(D|M), which is what we want to compute  and Z(0)=\int P(\theta|M) d\theta=1.

Hence, to perform Bayesian model comparison, we simply run the MCMC for each model at different temperatures (i.e. use P(D|M,\theta)^\beta as the likelihood in the standard MCMC) and then integrate the log likelihoods Z(1) over \beta at the end. For a Gaussian likelihood function, changing temperature is equivalent to changing the data “error”. The higher the temperature the larger the presumed error. In practice, I usually run at seven to ten different values of \beta and use a simple trapezoidal rule to integrate over \beta.  I can even do parameter inference and model comparison in the same MCMC run.

Erratum, 2013-5-2013,  I just fixed an error in the final formula

Prediction requires data

Nate Silver has been hailed in the media as a vindicated genius for correctly predicting the election. He was savagely attacked before the election for predicting that Obama would win handily. Kudos also go to Sam Wang, Pollester.com, electoral-vote.com, and all others who simply took the data obtained from the polls seriously. Hence, the real credit should go to all of the polling organizations for collectively not being statistically biased.  It didn’t matter if single organizations were biased one way or the other as long as they were not correlated in their biases. The true power of prediction in this election was that the errors of the various pollsters were independently distributed. However, even if you didn’t take the data at face value, you could still reasonably predict the election. Obama had an inherent advantage because he had more paths to winning 270 electoral votes. Suppose there were 8 battleground states and Romney needed to win at least 6 of them. Hence, Romney had 28 ways to win while Obama had 228 ways to win. If the win probability was approximately a half in each of these states, which is what a lot of people claimed,  then Romney has slightly more than one in ten chance of winning, which is close to the odds given by Sam. The only way Romney’s odds would increase is if the state results were correlated in his favour. However, it would take a lot of correlated bias to predict that Romney was a favourite.

 

Erratum, Nov 9 2012:  Romney actually has 37 ways and Obama 219 in my example.  The total must add up to 2^8=256.  I forgot to include the fact that Romney could also win 7 of 8 states or all states in his paths to winning.

Predicting the election

The US presidential election on Nov. 6 is expected to be particularly close. The polling has been vigorous and there are many statistical prediction web sites. One of them, the Princeton Election Consortium, is run by neuroscientist Sam Wang at Princeton University. For any non-American readers, the president is not elected directly by the citizens but through what is called the electoral college.  This is a set of 538 electoral voters that are selected by the individual states. The electoral votes are allotted according to the number of congressional districts per state plus two. Hence, low population states are over-represented. Almost all states agree that the candidate that takes the plurality of the votes in that state wins all the electoral votes of that state. Maine and Nebraska are the two exceptions that allot electoral votes according to who wins the congressional district. Thus in order to predict who will win, one must predict who will get at least 270 electoral votes. Most of the states are not competitive so the focus of the candidates (and media) are on a handful of so-called battleground states like Ohio and Colorado. Currently, Sam Wang predicts that President Obama will win the election with a median of 319 votes. Sam estimates the Bayesian probability for Obama’s re-election to be 99.6%. Nate Silver at another popular website (Five Thirty Eight), predicts that Obama will win 305 electoral votes and has a re-election probability of 83.7%.

These estimates are made by using polling data with a statistical model. Nate Silver uses national and state polls along with some economic indicators, although the precise model is unknown. Sam Wang uses only state polls. I’ll describe his method here. The goal is to estimate the probability distribution for the number of electoral votes a specific candidate will receive. The state space consists of 2^{51} possibilities (50 states plus the District of Columbia). I will assume that Maine and Nebraska do not split their votes along congressional districts although it is a simple task to include that possibility. Sam assumes that the individual states are statistically independent so that the joint probability distribution factorizes completely. He then takes the median of the polls for each state over some time window to represent the probability of that given state. The polling data is comprised of the voting preferences of a sample for a given candidate. The preferences are converted into probabilities using a normal distribution. He then computes the probability for all 2^{51} combinations. Suppose that there are just two states with win probabilities for your candidate of p_1 and p_2. The probability of your candidate winning both states is p_1 p_2, state 1 but not state 2 is p_1(1-p_2), and so forth.  If the states have EV_1 and EV_2 electoral votes respectively then if they win both states they will win EV_1+EV_2 votes and so forth. To keep the bookkeeping simple, Sam uses the trick of expressing the probability distribution as a polynomial of a dummy variable x.  So the probability distribution is

(p_1 x^{EV_1} + 1-p_1)(p_2 x^{EV_2} + 1-p_2)

= p_1 p_2 x^{EV_1+EV_2} + p_1(1-p_2) x^{EV_1} + (1-p_1)p_2 x^{EV_2} + (1-p_1)(1-p_2)

Hence, the coefficient of each term is the probability for the number of electoral votes given by the exponent of x.The expression for 51 “states” is  \prod_{i=1}^{51} (p_i x^{EV_i} + 1-p_i) and this can be evaluated quickly on a desktop computer. One can then take the median or mean of the distribution for the predicted  number of electoral votes. The sum of the probabilities for electoral votes greater than 269 gives the winning probability, although Sam uses a more sophisticated method for his predicted probabilities. The model does assume that the probabilities are independent.  Sam tries to account for this by using what he calls a meta-margin, in which he calculates how much the probabilities (in terms of preference) need to move for the leading candidate to lose. Also, the state polls will likely pick up any correlations as the election gets closer.

Most statistical models predict that Obama will be re-elected with fairly high probability but the national polls are showing that the race is almost tied. This discrepancy is a puzzle.  Silver’s hypothesis for why is here and Sam’s is here.  One of the sources for error in polls is that they must predict who will vote.  The 2008 election had a voter turnout of a little less than 62%. That means that an election can be easily won or lost based on turnout alone, which makes one wonder about democracy.

 

Nov 4: dead link is updated

Open Science Framework

In an effort to make published science to be less wrong, psychologist Brian Nosek and collaborators have started what is called the Open Science Framework.  The idea is that all results from experiments can be openly documented for everyone to see.  This way, negative results that are locked away in the proverbial “file drawer”,  will be available.  In light of the fact that many high impact results turn out to be wrong (e.g see here and here), we definitely needed to do something and I think this is a good start.  You can hear Nosek talk about this on Econtalk here.

What’s your likelihood?

Different internal Bayesian likelihood functions may be why we disagree. The recent shooting tragedies in Colorado and Wisconsin have set off a new round of arguments about gun control. Following the debate has made me realize that the reason the two sides can’t agree (and why different sides of almost every controversial issue can’t agree) is that their likelihood functions are completely different. The optimal way to make inferences about the world is to use Bayesian inference and there is some evidence that we are close to optimal in some circumstances. Nontechnically, Bayesian inference is a way to update  the strength of your belief in something (i.e. probability) given new data. What you do is to combine your prior probability with the likelihood of the data given your internal model of the issue (and then normalize to get a posterior probability). For a more technical treatment of Bayesian inference, see here. I posted previously (see here) that I thought that drastic differences in prior probabilities is why people don’t seem to update their beliefs when faced with overwhelming evidence to the contrary.  However, I’m starting to realize that the main reason might be that they have completely different models of the world, which in technical terms is their likelihood function.

Consider the issue of gun control.  The anti-gun control side argue that “guns don’t kill people, people kill people’ and that restricting access to guns won’t prevent determined malcontents from coming up with some means to kill. The pro-gun control side argues that the amount of gun violence is inversely proportional to the ease of access to guns. After all, you would be hard pressed to kill twenty people in a movie theatre with a sword. The difference in these two viewpoints can be summarized by their models of the world.  The anti-gun control people believe that the distribution of the will of people who would commit violence looks like this

where the horizontal line represents a level of gun restriction.  In this world view, no amount of gun restriction would prevent these people from undertaking their nefarious designs.  On the other hand, the pro-gun control side believes that the same distribution looks like this

in which case, the higher you set the barrier the fewer the number of crimes committed. Given these two views of the world, it is clear why new episodes of gun violence like the recent ones in Colorado and Wisconsin do not change people’s minds. What you would need to do is to teach a new likelihood function to the other side and that may take decades if at all.

 

 

We can all be above average

I was listening to physicist and science writer Leonard Mlodinow on an All in the Mind podcast this morning. He was talking about his new book, Subliminal, which is about recent neuroscience results on neural processes that operate in the absence of conscious awareness. During the podcast, which was quite good, Mlodinow quoted a result that said 95% of all professors think they are above average, then he went on to say that we all know that only %50 can be. It’s unfortunate that Mlodninow, who wrote an excellent book on probability theory, would make such a statement. I am  sure that he knows that 50% of all professors are better than the median but any number greater than one could be greater than the average (i.e. mean). He used average in the colloquial sense but knowing the difference between median and mean  is crucial for the average or should I say median person to make informed decisions.

It could be that on a professor performance scale,  5% of all professors are phenomenally bad, while the rest are better but clumped together. In this case, it would be absolutely true that 95% of all professors are better than the mean. Also, if the professors actually obeyed such a distribution then comparing to the mean would be more informative than the median because what every student should do is to simply avoid the really bad professors. However, for something like income, which is broad with a fat tail,  comparing yourself to the median is probably more informative because it will tell you where you place in society.  The mean salary of a lecture hall filled with mathematicians would increase perhaps a hundred fold if James Simons (of Chern-Simons theory as well as the CEO of  one of the most successful hedge funds, Renaissance Technologies) were to suddenly walk into the room. However, the median would hardly budge. (Almost) All the children in Lake Woebegon could really be above average.

Jun 18, 2013: I adjusted the last sentence to avoid confusion.

New paper in Biophysical Journal

Bayesian Functional Integral Method for Inferring Continuous Data from Discrete Measurements

Biophysical Journal, Volume 102, Issue 3, 399-406, 8 February 2012

doi:10.1016/j.bpj.2011.12.046

William J. Heuett, Bernard V. Miller, Susan B. Racette, John O. Holloszy, Carson C. Chow, and Vipul Periwal

Abstract: Inference of the insulin secretion rate (ISR) from C-peptide measurements as a quantification of pancreatic β-cell function is clinically important in diseases related to reduced insulin sensitivity and insulin action. ISR derived from C-peptide concentration is an example of nonparametric Bayesian model selection where a proposed ISR time-course is considered to be a “model”. An inferred value of inaccessible continuous variables from discrete observable data is often problematic in biology and medicine, because it is a priori unclear how robust the inference is to the deletion of data points, and a closely related question, how much smoothness or continuity the data actually support. Predictions weighted by the posterior distribution can be cast as functional integrals as used in statistical field theory. Functional integrals are generally difficult to evaluate, especially for nonanalytic constraints such as positivity of the estimated parameters. We propose a computationally tractable method that uses the exact solution of an associated likelihood function as a prior probability distribution for a Markov-chain Monte Carlo evaluation of the posterior for the full model. As a concrete application of our method, we calculate the ISR from actual clinical C-peptide measurements in human subjects with varying degrees of insulin sensitivity. Our method demonstrates the feasibility of functional integral Bayesian model selection as a practical method for such data-driven inference, allowing the data to determine the smoothing timescale and the width of the prior probability distribution on the space of models. In particular, our model comparison method determines the discrete time-step for interpolation of the unobservable continuous variable that is supported by the data. Attempts to go to finer discrete time-steps lead to less likely models.