The game theory world was stunned recently when Bill Press and Freeman Dyson found a new strategy to the iterated prisoner’s dilemma (IPD) game. They show how you can extort an opponent such that the only way they can maximize their payoff is to give you an even higher payoff. The paper, published in PNAS (link here) with a commentary (link here), is so clever and brilliant that I thought it would be worthwhile to write a pedagogical summary for those that are unfamiliar with some of the methods and concepts they use. This paper shows how knowing a little bit of linear algebra can go a really long way to exploring deep ideas.
In the classic prisoner’s dilemma, two prisoner’s are interrogated separately. They have two choices. If they both stay silent (cooperate) they get each get a year in prison. If one confesses (defects) while the other stays silent then the defector is released while the cooperator gets 5 years. If both defect then they both get 3 years in prison. Hence, even though the highest utility for both of them is to both cooperate, the only logical thing to do is to defect. You can watch this played out on the British television show Golden Balls (see example here). Usually the payout is expressed as a reward so if they both cooperate they both get 3 points, if one defects and the other cooperates then the defector gets 5 points and the cooperator gets zero, and if they both defect they both get 1 point each. Thus, the combined reward is higher if they both cooperate but since they can’t trust their opponent it is only logical to defect and get at least 1 point.
The prisoner’s dilema changes if you play the game repeatedly because you can now adjust to your opponent and it is not immediately obvious what the best strategy is. Robert Axelrod brought the IPD to public attention when he organized a tournament three decades ago. The results are published in his 1984 book The Evolution of Cooperation. I first learned about the results in Douglas Hofstader’s Metamagical Themas column in Scientific American in the early 1980s. Axelrod invited a number of game theorists to submit strategies to play IPD and the winner submitted by Anatol Rappaport was called tit-for-tat, where you always cooperate first and then do whatever your opponent does. Since this was a cooperative strategy with retribution, people have been using this example of how cooperation could evolve ever since those results. Press and Dyson now show that you can win by being nasty. Details of the calculations are below the fold.
Their analysis considers the IPD to be a probabilistic Markov chain, where the probability for a player to cooperate is determined by what was played by both players in the previous round. They also proved that having a long memory doesn’t help so you don’t need to keep track of what happened prior to the last round to play their strategy. Consider two players X and Y. Let C stand for cooperate and D stand for defect. For each round, there are only four possible plays {CC,CD,DC,DD}, where CD means X cooperated and Y defected and so forth. From now on, I will always order the states in this way, i.e. X comes first. The strategy for player X is completely specified by the conditional probability to cooperate
which is written as . In somewhat confusing but logical notation, the probabilities for Y to cooperate given the exact same states are written as
, note that 2 and 3 are reversed compared to X. This is so that 2 always means you cooperated and your opponent defected and vice versa for 3. The payout to X for each of the four states is
while the payout for Y is
, with
. Getting the notation straight is important for understanding the paper.
The conditional probabilities allow us to predict the play of both players in the next round given the state of the current round. For example, suppose both cooperate in round 1. The probability in round 2 that they both cooperate is that X cooperates and Y defects is
that X defects and Y cooperates is
and that both defect is
. We can similarly calculate the four probabilities for the four states given that X cooperates and Y defects, X defects and Y cooperates and both defects. There are sixteen transition probabilities in total that map the four states to four states. This is usually expressed as a 4 by 4 transition matrix. In matrix form, the probabilities of the states for subsequent rounds are found by repeated matrix multiplication of the transition matrix. Because all the elements of the transition matrix are positive and the rows sum to one, one eigenvalue is equal to 1 and the others are less than one. Hence, repeated matrix multiplication will converge to a stationary state starting from any initial condition. The average payouts to the players are then given by the dot product of the stationary probability of the states with the payout vectors.
The convention of Markov chains is to multiply on the right so that the stationary probabilities of the states (i.e. probability distribution) is given by the eigenvalue equation
In other words, the stationary distribution is the normalized left eigenvector of the transition matrix. The transition matrix is
If we let , the eigenvalue equation becomes
and it satisfies the condition
. Here is where some knowledge of linear algebra comes in handy. A formula for the determinant called Cramer’s rule is given by
where is the adjugate matrix, which is the transpose of the cofactor matrix. (Press and Dyson call it the “matrix of minors” from high school algebra, but it could be its transpose of that depending on which high school you went to.) The cofactor matrix C is given by the matrix of determinants of the matrix minors, i.e. element
is given by the determinant of the submatrix if row i and column j are excluded, and the sign is positive if the sum of i and j is even and negative if odd.
Press and Dyson then make the observation that since and
then the rows of
must be proportional to
(They don’t mention this but this is true as long as the left null space of
is one dimensional, which is true because
is rank 3.) Hence, the dot product of v and any vector f, such as the payout vector, is proportional to the dot product of any row of
with f. If we use the fourth row of
then
. Recall that the formula for the determinant of any matrix can be expanded in terms of the elements of any row or column multiplied by their corresponding minor matrices with the appropriate sign. Thus we can rewrite
as the determinant of the matrix
but with the fourth column replaced by
:
The determinant is unchanged if you add the first column to the second and third columns so
and we assign the middle two columns to
.
Notice that the second column only depends on and the third only depends on
. This means that either X or Y can choose a strategy (i.e. select a set of conditional probabilities) to set
to zero. Finally,
needs to be normalized such that
.
The average payouts are given by
Since is linear in
, we can set
and get
.
If then there is a linear relationship between the scores given by
. Press and Dyson call these zero-determinant (ZD) strategies. The amazing thing is that a player can unilaterally determine the other player’s score with a ZD strategy. For example, if X sets
(i.e. set
) then Y’s score is forced to be
, i.e. X unilaterally sets Y’s score. The probabilities for the strategy are given by
The question then is whether parameters can be chosen such that all probabilities are bounded between zero and one. The first thing to do is to eliminate and
and solve
and
in terms of
and
. A little algebra gives
and
Solving for and
and substituting into
gives the average payout
These expressions show that a viable solution is possible when is near 1 while
is near zero and that all possible scores between P and R can be forced on Y. However, a similar calculation shows that X cannot arbitrarily set his own score.
The extortion strategy for X is given by choosing for parameters
and
. Under this condition, Press and Dyson show that Y’s score is maximized only if Y always cooperates i.e.
but X’s score is always greater than Y’s score. If Y decides to play the same strategy as X then they both end up defecting all the time since
.
is the extortion parameter. If it is greater than one then X’s maximum score is always greater than mutual cooperation. When
, X’s strategy reduces to
, which is tit-for-tat, i.e. only cooperate when your opponent cooperates. Hence, the extortion strategy implies that IPD is like the ultimatum game, where one player proposes a split of a hundred dollars and the other player must choose to accept. If she accepts then both get the prize and if she doesn’t then no one gets the prize. Logically it always makes sense to accept although in practice most people will decline if the split is far enough from 50-50.
In a recent preprint (see here), Adami and Hintze claim that the ZD strategy is not evolutionarily stable, although I haven’t read the paper yet.