top of page
  • Writer's pictureeconomictribune

Evolutionary Learning in Strategic Form Games

By Suraj Sridhar, Chief Content and Partnerships Officer

Red Dead Redemption 2. A duel in Saint Denis. (Source:

Two figures, silhouetted black against the wide orange of a setting sun, shimmering in the evening haze. Their gazes lock unblinking, leather hats suppled in the midwestern heat, one bearing stars, the other bearing holes; their hips holster the cold steel that equalises law and outlaw, that buries good and evil alike.


Two simultaneous draws! A coincident crescendo; a cacophonous climax! No first, no second, but together now: shoot or don’t? Live or die? From boarded windows peering eyes widen, children’s eyes, watching, learning, wondering: who should prevail, sheriff or bandit? Freedom or federalism? Order or chaos?


One hammer falls. One does not.

One trusted, one betrayed.

An immemorial showdown between two players, the standoff is the ultimate gamble. Time freezes, the future is uncertain, your enemy is unpredictable. In game theory, we have immortalised it in the legend of the strategic form game, using it to describe standoffs everywhere from market competition to nuclear apocalypse.

Khrushchev and Kennedy at the height of the nuclear arms race, 1963

However the tale of the standoff tells us we’ve always missed one thing in the games we design: the concept of death itself. Losers in the real world don’t survive, while winners simply expropriate more resources for themselves. In our standoff, one gunslinger gets shot, while the other gets better at shooting.

And if death is missing, what of birth — by which either dynasties persist or politics moves on? A new mother cradles in her arms the future of knowledge, genes, and ideas; a whole evolutionary order is the context in which oligopolies collude or betray, in which Russia or India wins at chess, in which we fix or fumble climate change. Yet neither the cradle nor the grave are anywhere to be found in our strategic form games.

This is worrying. If we do not encode players who can evolve and adapt through the forces of birth and death, we cannot hope to show how players learn the optimal strategies in the real world, where there are hundreds of imperfect people, thousands of creative tactics, and millions of ways to miscalculate what the optimal move is in a given scenario.

Take the Prisoner’s Dilemma for instance. It tells a tale of cooperation and exploitation, yet when we observe these behaviours in nature, markets, war, and politics, we speak of them in the context of survival — of which behaviours can lead to life, and which to death.

Figure 1. The one-shot Prisoner’s Dilemma game for two agents. Higher payoffs represent better outcomes.

Imagine for a moment that — like most animals, consumers, soldiers, and (unfortunately) many politicians — neither of the players have taken a Microeconomics course and so don’t know the optimal way to play this game from the outset. However, let's give them a chance to learn: players can play this game multiple times. Can we build a model such that players learn the optimal strategy without being explicitly told what it is?

Building a Learning Model

For agents (players) to learn the optimal strategy across successive generations of the game, they need three things:

  1. A ledger of all the past games that have been played and their outcomes

  2. An instruction book on how to respond to different past outcomes

  3. A way to edit their instruction book based upon these past outcomes

Figure 2. A typical ledger of past outcomes, colour coded according to the final payoff for each agent. Notice we will use double-entry bookkeeping.

Points 1 and 2 are easy to imagine these appear as basic assumptions in game theory. Point 3 is more tricky. What does it mean to edit an instruction book based on past information?

Here’s where our concepts of life and death can help us out. We know that if an agent consistently earns low payoffs, then their strategies are not an optimal response to their opponents’ strategies. Repeating their behaviour will probably lead us away from the optimal strategies and not towards them. Nature’s solution: kill this agent!

Vice versa, if an agent consistently gets high payoffs, their strategies are actually quite good responses. Their instruction books are better adapted to the environment of the game and their opponents. So we keep them alive, since they hold the information we need to make agents’ strategies more optimal.

Exactly like the natural order favours the better adapted organisms, here we will favour the better adapted instruction books.

To conserve computing power, let’s restrict agents’ “memory” of the past outcomes to just the last three games that have been played.

Figure 3. Two typical instruction books (for agents 1 and 2). Accounting for 3 initial choices, there will be

different instructions for each agent.

Now, if we only have two agents and one of them dies after losing, our model isn’t very exciting. So let’s scale up to twenty agents total, where each agent plays every other, round-robin style.

This is where it gets interesting. To make our model learn the optimal strategies, after several games have been played between all agents, agents with low total payoffs and their instruction books are removed from the model (killed). Successful agents, on the other hand, are rewarded with a brilliant prize: reproduction!

We can allow successful agents to “mate” with each other to produce an offspring agent, which will have an instruction book with a mix of strategies taken from each of their parents at random. In this way, instruction books are a sort of “genome” with each instruction being a sort of “gene” and just like in natural selection, we can expect the better-adapted genomes to outpopulate the worse-adapted ones.

Testing the Learning Model

Figure 4. The final learning model.

Let’s see what the first few generations of games and genomes look like.

Figure 5. Game history is only shown for the first 20 rounds in each generation for agents 0 and 1. Genomes shown for all agents for the first three moves and 27 other histories.

Not surprisingly, we see outcomes collapse to the Nash equilibrium (Defect, Defect) pretty quickly after 15 generations of evolution.

Here are the final outcomes.

Figure 6. Outcomes after 30 generations of evolution. Abbreviations have been used to simplify display: M = (C,C); L = (C,D); W = (D,C); F = (D,D).

The real puzzle arises when you take a look at the genomes. By generation 30, you might find that some safe strategies have emerged, like tit-for-tat defection after (C,C); (C,C); (C,D), but also some risky ones, like exploiting trust by choosing to defect after three rounds of mutual cooperation. With nothing but a simple game and some selective strategy mixing, we have created agents that behave much like a human might in this scenario.

What determines which strategies emerge? Does it even matter for the outcomes? Should it?

Well — that’s up to you to find out, partner.

Extending the Model

You can find the code (in Python) for the model here.

On a final note, here are a few questions you might want to explore when playing with this model.

  1. What happens if we add random mutations to the genes every generation?

  2. Does behaviour always converge to the Nash equilibrium? What if we change the payoffs/the game itself?

  3. How does the network topology between agents on which the game is played affect outcomes?



Axelrod, R. (1997). The complexity of cooperation : agent-based models of competition and collaboration. Princeton, New Jersey: Princeton University Press.


bottom of page