A new strategy for the iterated prisoner’s dilemma game

The game theory world was stunned recently when Bill Press and Freeman Dyson found a new strategy to the iterated prisoner’s dilemma (IPD) game. They show how you can extort an opponent such that the only way they can maximize their payoff is to give you an even higher payoff. The paper, published in PNAS (link here) with a commentary (link here), is so clever and brilliant that I thought it would be worthwhile to write a pedagogical summary for those that are unfamiliar with some of the methods and concepts they use. This paper shows how knowing a little bit of linear algebra can go a really long way to exploring deep ideas.

In the classic prisoner’s dilemma, two prisoner’s are interrogated separately. They have two choices. If they both stay silent (cooperate) they get each get a year in prison. If one confesses (defects) while the other stays silent then the defector is released while the cooperator gets 5 years.  If both defect then they both get 3 years in prison. Hence, even though the highest utility for both of them is to both cooperate, the only logical thing to do is to defect. You can watch this played out on the British television show Golden Balls (see example here). Usually the payout is expressed as a reward so if they both cooperate they both get 3 points, if one defects and the other cooperates then the defector gets 5 points and the cooperator gets zero,  and if they both defect they both get 1  point each. Thus, the combined reward is higher if they both cooperate but since they can’t trust their opponent it is only logical to defect and get at least 1 point.

The prisoner’s dilema changes if you play the game repeatedly because you can now adjust to your opponent and it is not immediately obvious what the best strategy is. Robert Axelrod brought the IPD to public attention when he organized a tournament three decades ago. The results are published in his 1984 book The Evolution of Cooperation.  I first learned about the results in Douglas Hofstader’s Metamagical Themas column in Scientific American in the early 1980s. Axelrod invited a number of game theorists to submit strategies to play IPD and the winner submitted by Anatol Rappaport was called tit-for-tat, where you always cooperate first and then do whatever your opponent does.  Since this was a cooperative strategy with retribution, people have been using this example of how cooperation could evolve ever since those results. Press and Dyson now show that you can win by being nasty. Details of the calculations are below the fold.

Their analysis considers the IPD to be a probabilistic Markov chain, where the probability for a player to cooperate is determined by what was played by both players in the previous round. They also proved that having a long memory doesn’t help so you don’t need to keep track of what happened prior to the last round to play their strategy. Consider two players X and Y. Let C stand for cooperate and D stand for defect. For each round, there are only four possible plays {CC,CD,DC,DD}, where CD means X cooperated and Y defected and so forth. From now on, I will always order the states in this way, i.e. X comes first. The strategy for player X is completely specified by the conditional probability to cooperate

\{P(X=C | C C), P(X=C|C D), P(X=C |D C), P(X=C|DD)\}

which is written as \{p_1, p_2, p_3, p_4\}. In somewhat confusing but logical notation, the probabilities for Y to cooperate given the exact same states are written as \{q_1, q_3, q_2, p_4\}, note that 2 and 3 are reversed compared to X. This is so that 2 always means you cooperated and your opponent defected and vice versa for 3. The payout to X for each of the four states is S_X=\{R,S,T,P\} while the payout for Y is S_Y=\{R,T,S,P\}, with T>R>P>S. Getting the notation straight is important for understanding the paper.

The conditional probabilities allow us to predict the play of both players in the next round given the state of the current round. For example, suppose both cooperate in round 1. The probability in round 2 that they both cooperate  is p_1 q_1, that X cooperates and Y defects is p_1(1-q_1), that X defects and Y cooperates is (1-p_1)q_1, and that both defect is (1-p_1)(1-q_1). We can similarly calculate the four probabilities for the four states given that X cooperates and Y defects, X defects and Y cooperates and both defects.  There are sixteen transition probabilities in total that map the four states to four states. This is usually expressed as a 4 by 4 transition matrix. In matrix form, the probabilities of the states for subsequent rounds are found by repeated matrix multiplication of the transition matrix. Because all the elements of the transition matrix are positive and the rows sum to one, one eigenvalue is equal to 1 and the others are less than one. Hence, repeated matrix multiplication will converge to a stationary state starting from any initial condition. The average payouts to the players are then given by the dot product of the stationary probability of the states with the payout vectors.

The convention of Markov chains is to multiply on the right so that the stationary probabilities of the states v (i.e. probability distribution) is given by the eigenvalue equation

v^T M = v^T

In other words, the stationary distribution is the normalized left eigenvector of the transition matrix. The transition matrix is

M=\left [\begin{array}{cccc} p_1q_1 & p_1(1-q_1) & (1-p_1)q_1 & (1-p_1)(1-q_1)\\  p_2q_3 & p_2(1-q_3) & 1-p_2)q_3 & (1-p_2)(1-q_2) \\    p_3q_2 & p_3(1-q_2) & 1-p_3)q_2 & (1-p_3)(1-q_2) \\    p_4q_4 & p_4(1-q_4) & (1-p_4)q_4 & (1-p_4)(1-q_4)\end{array}\right ]

If we let M'=M-I, the eigenvalue equation becomes v^T M'=0 and it satisfies the condition \det(M-I)\equiv \det(M')=0. Here is where some knowledge of linear algebra comes in handy. A formula for the determinant called Cramer’s rule is given by

det(M') I=Adj(M')M'

where Adj(M') is the adjugate matrix, which is the transpose of the cofactor matrix. (Press and Dyson call it the “matrix of minors” from high school algebra, but it could be its transpose of that depending on which high school you went to.)  The cofactor matrix C is given by the matrix of determinants of the matrix minors, i.e. element C_{ij} is given by the determinant of the submatrix if row i and column j are excluded, and the sign is positive if the sum of i and j is even and negative if odd.

Press and Dyson then make the observation that since v^T M'=0 and Adj(M')M=0 then the rows of  adj(M') must be proportional to v.  (They don’t mention this but this is true as long as the left null space of M' is one dimensional, which is true because M' is rank 3.) Hence, the dot product of  v and any vector f, such as the payout vector, is proportional to the dot product of any row of Adj(M') with f. If we use the fourth row of Adj(M') then v\cdot f= C_{14} f_1 - C_{24} f_2 + C_{34} f_3 - C_{44} f_4. Recall that the formula for the determinant of any matrix can be expanded in terms of the elements of any row or column multiplied by their corresponding minor matrices with the appropriate sign. Thus we can rewrite v\cdot f as the determinant of the matrix M' but with the fourth column replaced by f:

v\cdot f=\det\left [\begin{array}{cccc} p_1q_1 -1 & p_1(1-q_1) & (1-p_1)q_1 & f_1\\  p_2q_3 & p_2(1-q_3) - 1 & (1-p_2)q_3 & f_2\\    p_3q_2 & p_3(1-q_2) & 1-p_3)q_2 - 1 &f_3\\    p_4q_4 & p_4(1-q_4) & (1-p_4)q_4 & f_4\end{array}\right ]

The determinant is unchanged if you add the first column to the second and third columns so

v\cdot f=\det\left [\begin{array}{cccc} p_1q_1 -1 & p_1-1 & q_1 -1& f_1\\  p_2q_3 & p_2 - 1 &q_3 & f_2\\    p_3q_2 & p_3 & q_2 - 1 &f_3\\    p_4q_4 & p_4 & q_4 & f_4\end{array}\right ]\equiv D(p, q, f)

and we assign the middle two columns to

\tilde{p} = \{p_1-1,p_2-1,p_3,p_4\}

\tilde{q} = \{q_1-1,q_3,q_2-1,q_4\}.

Notice that the second column only depends on p and the third only depends on q. This means that either X or Y can choose a strategy (i.e. select a set of conditional probabilities) to set D to zero. Finally, v needs to be normalized such that 1=\sum v_i = v\cdot \mathbf{1}=D(p,q,\mathbf{1}).

The average payouts are given by

s_X=v\cdot S_X = D(p,q,S_X)/D(p,q,\mathbf{1})

s_Y=v\cdot S_Y= D(p,q,S_Y)/D(p,q,\mathbf{1})

Since D is linear in f, we can set f=\alpha S_X+ \beta S_Y +\gamma and get

\alpha s_X +\beta s_Y+\gamma \propto D(p,q,\alpha S_X+ \beta S_Y +\gamma).

If  D(p,q,\alpha S_X+ \beta S_Y +\gamma)=0 then there is a linear relationship between the scores given by \alpha s_X +\beta s_Y+\gamma=0.   Press and Dyson call these zero-determinant (ZD) strategies. The amazing thing is that a player can unilaterally determine the other player’s score with a ZD strategy. For example, if X sets \tilde{p}=\beta S_Y+\gamma (i.e. set \alpha=0) then Y’s score is forced to be  s_Y = -\gamma /\beta, i.e. X unilaterally sets Y’s score. The probabilities for the strategy are given by

\begin{array}{l} p_1= 1+\beta R+\gamma\\p_2=1+\beta T +\gamma\\p_3=\beta S +\gamma\\p_4=\beta P +\gamma\end{array}

The question then is whether parameters can be chosen such that all probabilities are bounded between zero and one. The first thing to do is to  eliminate \beta and \gamma and solve  p_2 and p_3 in terms of p_1 and p_4. A little algebra gives

p_2 = \frac{p_1(T-P)-(1+p_4)(T-R)}{R-P} and p_3=\frac{(1-p_1)(P-S)+p_4(R-S)}{R-P}

Solving for \beta and \gamma and substituting into s_Y=-\gamma/\beta gives the average payout


These expressions show that a viable solution is possible when p_1 is near 1 while p_4 is near zero and that all possible scores between P and R can be forced on Y.  However, a similar calculation shows that X cannot arbitrarily set his own score.

The extortion strategy for X is given by choosing \tilde{p} = \phi[(S_x-P)-\chi(S_Y-P)] for parameters \phi and \chi. Under this condition, Press and Dyson show that Y’s score is maximized only if Y always cooperates i.e. q=\{1,1,1,1\} but X’s score is always greater than Y’s score.  If Y decides to play the same strategy as X then they both end up defecting all the time since p_4=q_4=0. \chi is the extortion parameter.  If it is greater than one then X’s maximum score is always greater than mutual cooperation.  When \chi = 1, X’s strategy reduces to \{1,0,1,0\}, which is tit-for-tat, i.e. only cooperate when your opponent cooperates.  Hence, the extortion strategy implies that IPD is like the ultimatum game, where one player proposes a split of a hundred dollars and the other player must choose to accept. If she accepts then both get the prize and if she doesn’t then no one gets the prize.  Logically it always makes sense to accept although in practice most people will decline if the split is far enough from 50-50.

In a recent preprint (see here), Adami and Hintze claim that the ZD strategy is not evolutionarily stable, although I haven’t read the paper yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s