Technology and inference

In my previous post, I gave an example of how fake news could lead to a scenario of no update of posterior probabilities. However, this situation could occur just from the knowledge of technology. When I was a child, fantasy and science fiction movies always had a campy feel because the special effects were unrealistic looking. When Godzilla came out of Tokyo Harbour it looked like little models in a bathtub. The Creature from the Black Lagoon looked like a man in a rubber suit. I think the first science fiction movie that looked astonishing real was Stanley Kubrick’s 1968 masterpiece 2001: A Space Odyssey, which adhered to physics like no others before and only a handful since. The simulation of weightlessness in space was marvelous and to me the ultimate attention to detail was the scene in the rotating space station where a mild curvature in the floor could be perceived. The next groundbreaking moment was the 1993 film Jurassic Park, which truly brought dinosaurs to life. The first scene of a giant sauropod eating from a tree top was astonishing. The distinction between fantasy and reality was forever gone.

The effect of this essentially perfect rendering of anything into a realistic image is that we now have a plausible reason to reject any evidence. Photographic evidence can be completely discounted because the technology exists to create completely fabricated versions. This is equally true of audio tapes and anything your read on the Internet. In Bayesian terms, we now have an internal model or likelihood function that any data could be false. The more cynical you are the closer this constant is to one. Once the likelihood becomes insensitive to data then we are in the same situation as before. Technology alone, in the absence of fake news, could lead to a world where no one ever changes their mind. The irony could be that this will force people to evaluate truth the way they did before such technology existed, which is that you believe people (or machines) that you trust through building relationships over long periods of time.

Fake news and beliefs

Much has been written of the role of fake news in the US presidential election. While we will never know how much it actually contributed to the outcome, as I will show below, it could certainly affect people’s beliefs. Psychology experiments have found that humans often follow Bayesian inference – the probability we assign to an event or action is updated according to Bayes rule. For example, suppose P(T) is the probability we assign to whether climate change is real; P(F) = 1-P(T) is our probability that climate change is false. In the Bayesian interpretation of probability, this would represent our level of belief in climate change. Given new data D (e.g. news), we will update our beliefs according to

P(T|D) = \frac{P(D|T) P(T)}{P(D)}

What this means is that our posterior probability or belief that climate change is true given the new data, P(T|D), is equal to the probability that the new data came from our internal model of a world with climate change (i.e. our likelihood), P(D|T), multiplied by our prior probability that climate change is real, P(T), divided by the probability of obtaining such data in all possible worlds, P(D). According to the rules of probability, the latter is given by P(D) = P(D|T)P(T) + P(D|F)P(F), which is the sum of the probability the data came from a world with climate change and that from one without.

This update rule can reveal what will happen in the presence of new data including fake news. The first thing to notice is that if P(T) is zero, then there is no update. In this binary case, this means that if we believe that climate change is absolutely false or true then no data will change our mind. In the case of multiple outcomes, any outcome with zero prior (has no support) will never change. So if we have very specific priors, fake news is not having an impact because no news is having an impact. If we have nonzero priors for both true and false then if the data is more likely from our true model then our posterior for true will increase and vice versa. Our posteriors will tend towards the direction of the data and thus fake news could have a real impact.

For example, suppose we have an internal model where we expect the mean annual temperature to be 10 degrees Celsius with a standard deviation of 3 degrees if there is no climate change and a mean of 13 degrees with climate change. Thus if the reported data is mostly centered around 13 degrees then our belief of climate change will increase and if it is mostly centered around 10 degrees then it will decrease. However, if we get data that is spread uniformly over a wide range then both models could be equally likely and we would get no update. Mathematically, this is expressed as – if P(D|T)=P(D|F) then P(D) = P(D|T)(P(T)+P(F))= P(D|T). From the Bayesian update rule, the posterior will be identical to the prior. In a world of lots of misleading data, there is no update. Thus, obfuscation and sowing confusion is a very good strategy for preventing updates of priors. You don’t need to refute data, just provide fake examples and bury the data in a sea of noise.


Revolution vs incremental change

I think that the dysfunction and animosity we currently see in the US political system and election is partly due to the underlying belief that meaningful change cannot be effected through slow evolution but rather requires an abrupt revolution where the current system is torn down and rebuilt. There is some merit to this idea. Sometimes the structure of a building can be so damaged that it would be easier to demolish and rebuild rather than repair and renovate. Mathematically, this can be expressed as a system being stuck in a local minimum (where getting to the global minimum is desired). In order to get to the true global optimum, you need to get worse before you can get better. When fitting nonlinear models to data, dealing with local minima is a major problem and the reason that a stochastic MCMC algorithm that does occasionally go uphill works so much better than gradient descent, which only goes downhill.

However, the recent success of deep learning may dispel this notion when the dimension is high enough. Deep learning, which is a multi-layer neural network that can have millions of parameters is the quintessence of a high dimensional model. Yet, it seems to be able to work just fine using the back propagation algorithm, which is a form of gradient descent. The reason could be that in high enough dimensions, local minima are rare and the majority of critical points (places where the slope is zero) are high dimensional saddle points, where there is always a way out in some direction. In order to have a local minimum, the matrix of second derivatives in all directions (i.e. Hessian matrix) must be positive definite (i.e. have all positive eigenvalues). As the dimension of the matrix gets larger and larger there are simply more ways for one eigenvalue to be negative and that is all you need to provide an escape hatch. So in a high dimensional system, gradient descent may work just fine and there could be an interesting tradeoff between a parsimonious model with few parameters but difficult to fit versus a high dimensional model that is easy to fit. Now the usual danger of having too many parameters is that you overfit and thus you fit the noise at the expense of the signal and have no ability to generalize. However, deep learning models seem to be able to overcome this limitation.

Hence, if the dimension is high enough evolution can work while if it is too low then you need a revolution. So the question is what is the dimensionality of governance and politics. In my opinion, the historical record suggests that revolutions generally do not lead to good outcomes and even when they do small incremental changes seem to get you to a similar place. For example, the US and France had bloody revolutions while Canada and the England did not and they all have arrived at similar liberal democratic systems. In fact, one could argue that a constitutional monarchy (like Canada and Denmark), where the head of state is a figure head is more stable and benign than a republic, like Venezuela or Russia (e.g. see here). This distinction could have pertinence for the current US election if a group of well-meaning people, who believe that the two major parties do not have any meaningful difference, do not vote or vote for a third party. They should keep in mind that incremental change is possible and small policy differences can and do make a difference in people’s lives.

AlphaGo and the Future of Work

In March of this year, Google DeepMind’s computer program AlphaGo defeated world Go champion Lee Sedol. This was hailed as a great triumph of artificial intelligence and signaled to many the beginning of the new age when machines take over. I believe this is true but the real lesson of AlphaGo’s win is not how great machine learning algorithms are but how suboptimal human Go players are. Experts believed that machines would not be able to defeat humans at Go for a long time because the number of possible games is astronomically large, \sim 250^{150} moves, in contrast to chess with a paltry \sim 35^{80} moves. Additionally, unlike chess, it is not clear what is a good position and who is winning during intermediate stages of a game. Thus, any direct enumeration and evaluation of possible next moves as chess computers do, like IBM’s Deep Blue that defeated Gary Kasparov, seemed to be impossible. It was thought that humans had some sort of inimitable intuition to play Go that machines were decades away from emulating. It turns out that this was wrong. It took remarkably little training for AlphaGo to defeat a human. All the algorithms used were fairly standard – supervised and reinforcement backpropagation learning in multi-layer neural networks1. DeepMind just put them together in a clever way and had the (in retrospect appropriate) audacity to try.

The take home message of AlphaGo’s success is that humans are very, very far away from being optimal at playing Go. Uncharitably, we simply stink at Go. However, this probably also means that we stink at almost everything we do. Machines are going to take over our jobs not because they are sublimely awesome but because we are stupendously inept. It is like the old joke about two hikers encountering a bear and one starts to put on running shoes. The other hiker says: “Why are you doing that? You can’t outrun a bear.” to which she replies, “I only need to outrun you!” In fact, the more difficult a job seems to be for humans to perform, the easier it will be for a machine to do better. This was noticed a long time ago in AI research and called Moravec’s Paradox. Tasks that require a lot of high level abstract thinking like chess or predicting what movie you will like are easy for computers to do while seemingly trivial tasks that a child can do like folding laundry or getting a cookie out of a jar on an unreachable shelf is really hard. Thus high paying professions in medicine, accounting, finance, and law could be replaced by machines sooner than lower paying ones in lawn care and house cleaning.

There are those who are not worried about a future of mass unemployment because they believe people will just shift to other professions. They point out that a century ago a majority of Americans worked in agriculture and now the sector comprises of less than 2 percent of the population. The jobs that were lost to technology were replaced by ones that didn’t exist before. I think this might be true but in the future not everyone will be a software engineer or a media star or a CEO of her own company of robot employees. The increase in productivity provided by machines ensures this. When the marginal cost of production goes to zero (i.e. cost to make one more item), as it is for software or recorded media now, the whole supply-demand curve is upended. There is infinite supply for any amount of demand so the only way to make money is to increase demand.

The rate-limiting step for demand is the attention span of humans. In a single day, a person can at most attend to a few hundred independent tasks such as thinking, reading, writing, walking, cooking, eating, driving, exercising, or consuming entertainment. I can stream any movie I want now and I only watch at most twenty a year, and almost all of them on long haul flights. My 3 year old can watch the same Wild Kratts episode (great children’s show about animals) ten times in a row without getting bored. Even though everyone could be a video or music star on YouTube, superstars such as Beyoncé and Adele are viewed much more than anyone else. Even with infinite choice, we tend to do what are peers do. Thus, for a population of ten billion people, I doubt there can be more than a few million that can make a decent living as a media star with our current economic model. The same goes for writers. This will also generalize to manufactured goods. Toasters and coffee makers essentially cost nothing compared to three decades ago, and I will only buy one every few years if that. Robots will only make things cheaper and I doubt there will be a billion brands of TV’s or toasters. Most likely, a few companies will dominate the market as they do now. Even, if we could optimistically assume that a tenth of the population could be engaged in producing goods and services necessary for keeping the world functioning that still leaves the rest with little to do.

Even much of what scientists do could eventually be replaced by machines. Biology labs could consist of a principle investigator and robot technicians. Although it seems like science is endless, the amount of new science required for sustaining the modern world could diminish. We could eventually have an understanding of biology sufficient to treat most diseases and injuries and develop truly sustainable energy technologies. In this case, machines could be tasked to keep the modern world up and running with little need of input from us. Science would mostly be devoted to abstract and esoteric concerns.

Thus, I believe the future for humankind is in low productivity occupations – basically a return to pre-industrial endeavors like small plot farming, blacksmithing, carpentry, painting, dancing, and pottery making, with an economic system in place to adequately live off of this labor. Machines can provide us with the necessities of life while we engage in a simulated 18th century world but without the poverty, diseases, and mass famines that made life so harsh back then. We can make candles or bread and sell them to our neighbors for a living wage. We can walk or get in self-driving cars to see live performances of music, drama and dance by local artists. There will be philosophers and poets with their small followings as they have now. However, even when machines can do everything humans can do, there will still be a capacity to sustain as many mathematicians as there are people because mathematics is infinite. As long as P is not NP, theorem proving can never be automated and there will always be unsolved math problems.  That is not to say that machines won’t be able to do mathematics. They will. It’s just that they won’t ever be able to do all of it. Thus, the future of work could also be mathematics.

  1. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

The simulation argument made quantitative

Elon Musk, of Space X, Tesla, and Solar City fame, recently mentioned that he thought the the odds of us not living in a simulation were a billion to one. His reasoning was based on extrapolating the rate of improvement in video games. He suggests that soon it will be impossible to distinguish simulations from reality and in ten thousand years there could easily be billions of simulations running. Thus there are a billion more simulated universes than real ones.

This simulation argument was first quantitatively formulated by philosopher Nick Bostrom. He even has an entire website devoted to the topic (see here). In his original paper, he proposed a Drake-like equation for the fraction of all “humans” living in a simulation:

f_{sim} = \frac{f_p f_I N_I}{f_p f_I N_I + 1}

where f_p is the fraction of human level civilizations that attain the capability to simulate a human populated civilization, f_I is the fraction of these civilizations interested in running civilization simulations, and N_I is the average number of simulations running in these interested civilizations. He then argues that if N_I is large, then either f_{sim}\approx 1 or f_p f_I \approx 0. Musk believes that it is highly likely that N_I is large and f_p f_I is not small so, ergo, we must be in a simulation. Bostrom says his gut feeling is that f_{sim} is around 20%. Steve Hsu mocks the idea (I think). Here, I will show that we have absolutely no way to estimate our probability of being in a simulation.

The reason is that Bostrom’s equation obscures the possibility of two possible divergent quantities. This is more clearly seen by rewriting his equation as

f_{sim} = \frac{y}{x+y} = \frac{y/x}{y/x+1}

where x is the number of non-sim civilizations and y is the number of sim civilizations. (Re-labeling x and y as people or universes does not change the argument). Bostrom and Musk’s observation is that once a civilization attains simulation capability then the number of sims can grow exponentially (people in sims can run sims and so forth) and thus y can overwhelm x and ergo, you’re in a simulation. However, this is only true in a world where x is not growing or growing slowly. If x is also growing exponentially then we can’t say anything at all about the ratio of y to x.

I can give a simple example.  Consider the following dynamics

\frac{dx}{dt} = ax

\frac{dy}{dt} = bx + cy

y is being created by x but both are both growing exponentially. The interesting property of exponentials is that a solution to these equations for a > c is

x = \exp(at)

y = \frac{b}{a-c}\exp(at)

where I have chosen convenient initial conditions that don’t affect the results. Even though y is growing exponentially on top of an exponential process, the growth rates of x and y are the same. The probability of being in a simulation is then

f_{sim} = \frac{b}{a+b-c}

and we have no way of knowing what this is. The analogy is that you have a goose laying eggs and each daughter lays eggs, which also lay eggs. It would seem like there would be more eggs from the collective progeny than the original mother. However, if the rate of egg laying by the original mother goose is increasing exponentially then the number of mother eggs can grow as fast as the number of daughter, granddaughter, great…, eggs. This is just another example of how thinking quantitatively can give interesting (and sometimes counterintuitive) results. Until we have a better idea about the physics underlying our universe, we can say nothing about our odds of being in a simulation.

Addendum: One of the predictions of this simple model is that there should be lots of pre-sim universes. I have always found it interesting that the age of the universe is only about three times that of the earth. Given that the expansion rate of the universe is actually increasing, the lifetime of the universe is likely to be much longer than the current age. So, why is it that we are alive at such an early stage of our universe? Well, one reason may be that the rate of universe creation is very high and so the probability of being in a young universe is higher than being in an old one.

Addendum 2: I only gave a specific solution to the differential equation. The full solution has the form Y_1\exp(at) + Y_2 \exp(ct).  However, as long as a >c, the first term will dominate.

Addendum 3: I realized that I didn’t make it clear that the civilizations don’t need to be in the same universe. Multiverses with different parameters are predicted by string theory.  Thus, even if there is less than one civilization per universe, universes could be created at an exponentially increasing rate.


Chomsky on The Philosopher’s Zone

Listen to MIT Linguistics Professor Noam Chomsky on ABC’s radio show The Philosopher’s Zone (link here).  Even at 87, he is still as razor sharp as ever. I’ve always been an admirer of Chomsky although I think I now mostly disagree with his ideas about language. I do remember being completely mesmerized by the few talks I attended when I was a graduate student.

Chomsky is the father of modern linguistics. He turned it into a subfield of computer science and mathematics. People still use Chomsky Normal Form and the Chomsky Hierarchy in computer science. Chomsky believes that the language ability is universal among all humans and is genetically encoded. He comes to this conclusion because in his mathematical analysis of language he found what he called “deep structures”, which are embedded rules that we are consciously unaware of when we use language. He was adamantly opposed to the idea that language could be acquired via a probabilistic machine learning algorithm. His most famous example is that we know that the sentence “Colorless green ideas sleep furiously” makes grammatical sense but is nonsensical while the sentence “Furiously sleep ideas green colorless”, is nongrammatical. Since, neither of these sentences had ever been spoken nor written he surmised that no statistical algorithm could ever learn the difference between the two. I think it is pretty clear now that Chomsky was incorrect and machine learning can learn to parse language and classify these sentences. There has also been field work that seems to indicate that there do exist languages in the Amazon that are qualitatively different form the universal set. It seems that the brain, rather than having an innate ability for grammar and language, may have an innate ability to detect and learn deep structure with a very small amount of data.

The host Joe Gelonesi, who has filled in admirably for the sadly departed Alan Saunders, asks Chomsky about the hard problem of consciousness near the end of the program. Chomsky, in his typical fashion of invoking 17th and 18th century philosophy, dismisses it by claiming that science itself and physics in particular has long dispensed with the equivalent notion. He says that the moment that Newton wrote down the equation for gravitational force, which requires action at a distance, physics stopped being about making the universe intelligible and became about creating predictive theories. He thus believes that we will eventually be able to create a theory of consciousness although it may not be intelligible to humans. He also seems to subscribe to panpsychism, where consciousness is a property of matter like mass, an idea championed by Christof Koch and Giulio Tononi. However, as I pointed out before, panpsychism is dualism. If it does exist, then it exists apart from the way we currently describe the universe. Lately, I’ve come to believe and accept the fact that consciousness is an epiphenomenon and has no causal consequence in the universe. I must credit David Chalmers (e.g. see previous post) for making it clear that this is the only recourse to dualism. We are no more nor less than automata caroming through the universe, with the ability to spectate a few tens of milliseconds after the fact.

Addendum: As pointed out in the comments, there are monoistic theories, as espoused by Bishop Berkeley, where only ideas are real.  My point about the only recourse to dualism is epiphenomena for consciousness, is if one adheres to materialism.






Probability of gun death

The tragedy in Oregon has reignited the gun debate. Gun control advocates argue that fewer guns mean fewer deaths while gun supporters argue that if citizens were armed then shooters could be stopped through vigilante action. These arguments can be quantified in a simple model of the probability of gun death, p_d:

p_d = p_gp_u(1-p_gp_v) + p_gp_a

where p_g is the probability of having a gun, p_u is the probability of being a criminal or  mentally unstable enough to become a shooter, p_v is the probability of effective vigilante action, and p_a is the probability of accidental death or suicide.  The probability of being killed by a gun is given by the probability of someone having a gun times the probability that they are unstable enough to use it. This is reduced by the probability of a potential victim having a gun times the probability of acting effectively to stop the shooter. Finally, there is also a probability of dying through an accident.

The first derivative of p_d with respect to p_g is p_u - 2 p_u p_g p_v + p_a and the second derivative is negative. Thus, the minimum of p_d cannot be in the interior 0 < p_g < 1 and must be at the boundary. Given that p_d = 0 when p_g=0 and p_d = p_u(1-p_v) + p_a when p_g = 1, the absolute minimum is found when no one has a gun. Even if vigilante action was 100% effective, there would still be gun deaths due to accidents. Now, some would argue that zero guns is not possible so we can examine if it is better to have fewer guns or more guns. p_d is maximal at p_g = (p_u + p_a)/(2p_u p_v). Thus, unless p_v is greater than one half then even in the absence of accidents there is no situation where increasing the number of guns makes us safer. The bottom line is that if we want to reduce gun deaths we should either reduce the number of guns or make sure everyone is armed and has military training.