A more general model of cooperation based on
reinforcement
learning: Alignment and Integration of the
Bush-Mosteller and the
Roth-Erev model*
Paper
to be presented at the Model To Model (M2M) Workshop, March 31st
- April 1st, Marseille,
France.
Andreas
Flache
Department
of Sociology, University of Groningen
Grote
Rozenstraat 31. 9712 TG Groningen, The
Netherlands
a.flache@ppsw.rug.nl
Michael W. Macy
Department of Sociology, Cornell University
Ithaca N.Y. 14853
mwm14@cornell.edu
Stochastic Collusion and the Power
Law of Learning:
Aligning and Integrating the Bush-Mosteller and the Roth-Erev Reinforcement Learning Models of Cooperation
Analytical
game theory has developed the Nash equilibrium as theoretical tool for the
analysis of cooperation and conflicts in interdependent decision making.
Indeterminacy and demanding rationality assumptions of the Nash equilibrium
have led cognitive game theorists to explore learning-theoretic models of
behavior. Two prominent examples are the Bush-Mosteller stochastic learning
model and the Roth-Erev payoff-matching model. We align and integrate the two
models as special cases of a General Reinforcement Learning Model. Both models
predict stochastic collusion as a
backward-looking solution to the problem of cooperation in social dilemmas,
based on a random walk into a self-reinforcing cooperative equilibrium. The
integration also uncovers hidden assumptions that constrain the generality of
the theoretical derivations. Specifically, Roth and Erev assume a “Power Law of
Learning” - the curious but plausible tendency for learning to
diminish with success and intensify with failure, which we call “fixation.” We
use computer simulation to explore the effects of fixation on stochastic collusion
in three social dilemma games. The analysis shows how the integration of
alternative models can uncover underlying principles and lead to a more general
theory.
Conflicts
as diverse as those between neighbors, friends, business associates, ethnic
groups, or states, often ensue from the same fundamental logic. On the one hand
the parties involved have common interests, such as avoidance of an all out
nuclear war, on the other hand they may fall prey to the temptation to pursue individual
agendas at the expense of the commons. To explore how and under what conditions
such relationships may proceed cooperatively rather than conflictuous, game
theory formalizes the problem of cooperation. At the
most elementary level, the theory represents relationships as a mixed-motive
two-person game with two choices, cooperate and defect. These choices intersect
at four possible outcomes, abbreviated as CC,
CD, DD, and DC. Each outcome
has an associated payoff: R (reward),
S (sucker), P (punishment) and T
(temptation), respectively. Using these payoffs, we define a two-person social
dilemma[1]
as any ordering of these payoffs such that mutual cooperation is Pareto optimal
yet may be undermined by the temptation to cheat (if T>R) or by the fear of
being cheated (if P>S) or by both. In the game of “Stag
Hunt” the problem is “fear” but not “greed” (R>T>P>S),
and in the game of “Chicken” the problem is “greed” but not “fear” (T>R>S>P).
The problem is most challenging when both fear and greed are present, that is,
when T>R and P>S. Given the assumption that R>P, there is only one way this can
happen, if T>R>P>S, the
celebrated game of “Prisoner’s Dilemma” (PD).
The Nash equilibrium[2] - the main solution concept in analytical game theory - predicts mutual defection in PD, unilateral defection in Chicken, and either mutual cooperation or mutual defection in Stag Hunt. However, Nash cannot make precise predictions about the selection of supergame equilibria, that is, about the outcome of on-going mixed-motive games. Nor can it tell us much about the dynamics by which a population of players can move from one equilibrium to another. These limitations, along with concerns about the cognitive demands of forward-looking rationality (Dawes and Thaler 1988; Weibull 1998; Fudenberg and Levine 1998), have led game theorists to explore models of cognition that explicitly describe the dynamics of stepwise decision making. This is reflected in a growing number of formal learning-theoretic models of cooperative behavior (Macy 1991; Roth and Erev 1995; Fudenberg and Levine 1998; Peyton-Young 1998; Cohen et. al. 2001). In learning, positive outcomes increase the probability that the associated behavior will be repeated, while negative outcomes reduce it.
In general form, these simple game-theoretic learning models consist of a probabilistic decision rule and a learning algorithm in which game payoffs are evaluated relative to an aspiration level, and the corresponding choice propensities are updated accordingly. The schematic model is diagrammed in Figure 1.

Figure 1. Schematic model of reinforcement learning.
The first step in Figure 1 is the decision by each player whether to cooperate or defect. This decision is probabilistic, based on the player’s current propensity to cooperate. The resulting outcome then generates payoffs (R, S, P, or T) that the players evaluate as satisfactory or unsatisfactory relative to their aspiration level. Satisfactory payoffs present a positive stimulus (or reward) and unsatisfactory payoffs present a negative stimulus (or punishment). These rewards and punishments then modify the probability of repeating the associated choice, such that satisfactory choices become more likely to be repeated, while unsatisfactory choices become less likely.
Erev, Roth and others
(Roth and Erev 1995; Erev and Roth 1998; Erev et al. 1999) used a learning model to estimate globally
applicable parameters from data collected across a variety of human subject
experiments. They concluded that “low rationality” models of reinforcement learning
may often provide a more accurate prediction than orthodox game theoretical
analysis. Macy (1995) and Flache (1996) have also tested a simple reinforcement
learning algorithm in controlled social dilemma experiments and found supporting
evidence.
Learning models have
also been applied outside the laboratory to mixed-motive games played in
everyday life, in which cooperation is largely unthinking and automatic, based
on heuristics, habits, routines, or norms, such as the propensity to loan a
tool to a neighbor, tell the truth, or trouble oneself to vote. For example,
Kanazawa (2000) used voting data to show how a simple learning model suggested
by Macy (1991) solves the “voter paradox” in public choice theory.
While interest in cognitive game theory is clearly growing, recent studies have been divided by disciplinary boundaries between sociology, psychology, and economics that have obstructed the theoretical integration needed to isolate and identify the learning principles that underlie new solution concepts. Our paper aims to move towards such an integration. We align two prominent models in cognitive game theory and integrate them into a general model of adaptive behavior in two-person mixed-motive games (or social dilemmas). In doing so, we will identify a set of model-independent learning principles that are necessary and sufficient to generate cooperative solutions. We begin by briefly summarizing the basic principles of reinforcement learning and then introduce two formal specifications of these principles, the Bush-Mosteller stochastic learning model and the Roth-Erev payoff matching model.
1.1 Learning Theory and the Law of Effect
In
reinforcement learning theory, the search for solutions is “backward-looking”
(Macy 1990) in that it is driven by experience rather than the forward-looking
calculation assumed in analytical game theory (Fudenberg and Levine 1998;
Weibull 1998). Thorndike (1898) formulated this as the “Law of Effect,” based
on the cognitive psychology of William James. If a behavioral response has a
favorable outcome, the neural pathways that triggered the behavior are
strengthened, which “loads the dice in favor of those of its performances which
make for the most permanent interests of the brain’s owner” (James 1981, p.
143). This connectionist theory anticipates the error back-propagation used in
contemporary neural networks (Rummelhart and McLelland 1988). These models show
how highly complex behavioral responses can be acquired through repeated
exposure to a problem. For example, with a little genetic predisposition and a
lot of practice, we can learn to catch a ball while running at full speed,
without having to stop and calculate the trajectory (as the ball passes
overhead).
More
precisely, learning theory relaxes three key behavioral assumptions in
analytical game-theoretic models of decision:
·
Propinquity replaces causality as the link between choices
and payoffs. Learning theory
assumes experiential induction rather than logical deduction. Players explore
the likely consequences of alternative choices and develop preferences for
those associated with better outcomes, even though the association may be
coincident, “superstitious,” or causally spurious.
·
Reward and punishment replace utility as the motivation for
choice. Learning theory differs
from game-theoretic utility theory in positing two distinct cognitive
mechanisms that guide decisions toward better outcomes, approach (driven by reward) and avoidance
(driven by punishment). The distinction means that aspiration levels are
very important for learning theory. The effect of an outcome depends decisively
on whether it is coded as gain or loss, satisfactory or unsatisfactory,
pleasant or aversive.
·
Melioration
replaces optimization as the basis for the distribution of choices over time. Melioration refers to suboptimal gradient climbing when confronted
with “distributed choice” (Herrnstein and Drazin 1991) across recurrent
decisions. Melioration implies a tendency to repeat choices with satisfactory
outcomes even if other choices have higher utility, a behavioral tendency March
and Simon (1958) call “satisficing.” In contrast, unsatisfactory outcomes
induce search for alternative outcomes, including a tendency to revisit
alternative choices whose outcomes are even worse, a pattern we call
“dissatisficing.”
The
Law of Effect does not solve the social dilemma, it merely reframes it: Where the penalty for cooperation is larger
than the reward, and the reward for aggressive behavior is larger than the
penalty, how can penalty-aversive, reward-seeking actors elude the trap of
mutual punishment?
1.2 The Bush-Mosteller Stochastic Learning Model
The earliest answer was given by Rapoport and Chammah (1965), who used learning theory to propose a Markov model of Prisoner’s Dilemma with state transition probabilities given by the payoffs for each state, based on the assumption that each player is satisfied only when the partner cooperates. With choice probabilities updated after each move based on the Law of Effect, mutual cooperation is an absorbing state in the Markov chain.
Macy (1990, 1991)
elaborated Rapoport and Chammah’s analysis using computer simulations of their
Bush-Mosteller stochastic learning model. Macy identified two
learning-theoretic equilibria in the PD game, corresponding to each of the two
learning mechanisms, approach and avoidance. Approach implies a
self-reinforcing equilibrium (SRE)
characterized by satisficing behavior. The SRE obtains when a strategy pair
yields payoffs that are mutually rewarding. The SRE can obtain even when both
players receive less than their optimal payoff (such as the R payoff for mutual cooperation in the
PD game), so long as this payoff exceeds aspirations. Avoidance implies an
aversive self-correcting equilibrium (SCE) characterized by dissatisficing
behavior. Dissatisficing means that both players will try to avoid an outcome
that is better than their worst possible payoff (such as P in the PD game), so long as this payoff is below aspirations. The
SCE obtains when the expected change of probabilities is zero and there is a
positive probability of punishment as well as reward. This happens when
outcomes that reward cooperation or punish defection (causing the probability
of cooperation to increase) balance outcomes that punish cooperation or reward
defection (causing the probability to decline).[3] At equilibrium,
the dynamics pushing the probability higher are balanced by the dynamics
pushing in the other direction, like a tug-of-war between two equally strong
teams.
These learning
theoretic equilibria differ fundamentally from the Nash predictions in that
agents have an incentive to unilaterally change strategy. In the SRE, both
players have an incentive to deviate from unconditional mutual cooperation, but
they stay the course so long as the R
payoff exceeds their aspirations. Unlike the Nash equilibrium, the SCE has
players constantly changing course, but their efforts are self-defeating. It is
not that everyone decides they are doing the best they can, it is that their
efforts to do better set into motion a dynamic that pulls the rug out from
under everyone.
Suppose two players in
PD are each satisfied only when the partner cooperates, and each starts out
with zero probability of cooperation. They are both certain to defect, which
then causes both probabilities to increase (as an avoidance response).
Paradoxically, an increased probability of cooperation now makes a unilateral
outcome (CD or DC) more likely
than before, and these outcomes punish the cooperator and reward the defector,
causing both probabilities to drop. Nevertheless, there is always the chance
that both players will defect anyway, causing probabilities to rise further
still. Once both probabilities exceed 0.5, further increases become
self-reinforcing, by increasing the chances for another bilateral move, and
this move is now more likely to be mutual cooperation instead of mutual
defection. In short, the players can escape the social trap through stochastic collusion, characterized by a
chance sequence of bilateral moves in a “drunkard’s walk.” A fortuitous string
of bilateral outcomes can increase cooperative probabilities to the point that
cooperation becomes self-sustaining - the drunkard wanders out of the gully.
1.3 The Roth-Erev Payoff-Matching Model
More
recently, Roth and Erev (Roth and Erev 1995; Erev and Roth 1998; Erev et al.
1999) have proposed a learning-theoretic alternative to the earlier
Bush-Mosteller formulation. Their model draws on the “matching law” which holds
that adaptive actors will choose between alternatives in a ratio that matches
the ratio of reward. Applied to social dilemmas, the matching law predicts that
players will learn to cooperate to the extent that the payoff for cooperation
exceeds that for defection, which is possible only if both players happen to
cooperate and defect at the same time (given R>P). As with
Bush-Mosteller, the path to cooperation is a sequence of bilateral moves.
Like the Bush-Mosteller
stochastic learning model, the Roth-Erev payoff matching model implements the three
basic principles that distinguish learning from utility theory – experiential
induction (vs. logical deduction), reward and punishment (vs. utility), and
melioration (vs. optimization). The similarity in substantive assumptions makes
it tempting to assume that the two models are mathematically equivalent, or if
not, that they nevertheless give equivalent solutions.
On closer inspection,
however, we find important differences. Each specification implements
reinforcement learning in different ways, and with different results. Without a
systematic theoretical alignment and integration of the two algorithms, it is
not clear whether and under what conditions the backward-looking solution for
social dilemmas identified with the Bush-Mosteller (BM) specification generalizes
to the Roth-Erev (RE) model.
Those assumptions can
be brought to the surface by close comparison of competing models and by
integrating alternative specifications as special cases of a more general
model. This is especially important for models that must rely on computational
rather than mathematical methods. “Without such a process of close comparison,
computational modeling will never provide the clear sense of ‘domain of
validity’ that typically can be obtained for mathematized theories” (Axtell et
al. 1996, p. 123).
Unfortunately, learning
models have been divided by disciplinary boundaries between sociology,
psychology, and economics that have obstructed the theoretical integration
needed to isolate and identify the learning principles that underlie new
solution concepts. Accordingly, this paper aims to align two prominent models
in cognitive game theory and to integrate them into a general model of adaptive
behavior in two-person mixed-motive games. By “docking” (Axtell et al. 1996)
the BM stochastic learning model with the RE payoff-matching model, we can
identify learning principles that generalize beyond particular specifications,
and at the same time, uncover hidden assumptions that explain differences in
the outcomes and constrain the generality of the theoretical derivations.
In section 2 that
follows, we formally align the two learning models and integrate them into a
General Reinforcement Learning (GRL) model, for which the Bush-Mosteller and
Roth-Erev models are special cases. This process brings to the surface a key
hidden assumption with important implications for the generality of
learning-theoretic solution concepts. Section 3 then uses computer simulations
of the integrated model to explore the effects of model differences across a
range of two-person social dilemma games. The analyses confirm the generality
of stochastic collusion but also show that its determinants depend decisively
on whether BM or RE assumptions about learning dynamics are employed.
In general form,
both the BM and RE models implement the stochastic decision process diagrammed
in Figure 1, in which choice propensities are updated by the associated
outcomes. Thus, both
models imply the existence of some aspiration level relative to which cardinal
payoffs can be positively or negatively evaluated. Formally, the stimulus s associated with action a is calculated as
[1]
where
pa is the payoff
associated with action a (R or S
if a = C, and T or P if a = D) and sa is a positive or negative stimulus derived from pa. The denominator in [1] represents the upper value of the
set of possible differences between payoff and aspiration. With this scaling
factor, stimulus s is always equal to
or less than unity in absolute value, regardless of the magnitude of the corresponding
payoff.[4]
Neither model
imposes constraints on the determinants of aspirations.
Whether aspirations are high or low or
habituate with experience depends on assumptions that are exogenous to both
models.
2.1 Aligning the two models
The
two models diverge at the point that these evaluations are used to update
choice probabilities. The Bush-Mosteller stochastic learning algorithm updates
probabilities following an action a
(cooperation or defection) as follows:
[2]
In
equation [2], pa,t is the
probability of action a at time t and
is the positive
or negative stimulus given by [1]. The change in the probability for the action
not taken, b, obtains from the
constraint that probabilities always sum to one, i.e.
. The parameter l
is a constant (0 < l < 1) that
scales the learning rate. With l » 0, learning is
very slow, and with l » 1,
the model approximates a “win-stay, lose-shift” strategy (Catania 1992).
For any value of l, Equation 2 implies a decreasing
effect of reward as the associated propensity approaches unity, but an
increasing effect of punishment. Similarly, as the propensity approaches zero,
there is a decreasing effect of punishment and a growing effect of reward. This
constrains probabilities to approach asymptotically their natural limits.
Like the
Bush-Mosteller, the Roth-Erev model is stochastic, but the probabilities are
not equivalent to propensities. Propensities are a function of cumulative
satisfication and dissatisfaction with the associated choices, and
probabilities are a function of the ratio of propensities. More precisely, the
propensity q for action a at time T is the sum of all stimuli
a player has ever
received when playing a:
[3]
Roth and Erev then use a “probabilistic choice rule” to
translate propensities into behavior. The probability pa of action a
at time t+1 is the propensity for a divided by the sum of the propensities
at time t:
[4]
where
a and b represent the binary choices to cooperate or defect. Following
action a, the associated propensity qa increases if the payoff is
positive relative to aspirations (by increasing the numerator in [4]) and
decreases if negative. The propensity for b
remains constant, but the probability of b
declines (by increasing the denominator in the equivalent expression for pb,t+1).
An obvious problem with
the specification of equations [3] and [4] (but not [2]) is the possibility of
negative probabilities if punishment dominates reinforcement. In their original
model, Erev, Roth and Bereby-Meyer (1999) circumvent this problem with the ad hoc addition of a “clipping” rule.
More precisely, their implementation assumes that the new propensity qa,t+1 = qa,t + sa,t except for
the case where this sum drops below a very small positive constant n. In that case, the
new propensity is reset to n. Equations [5] and [6] represent the RE
clipping rule by rewriting [3] as
[5]
where
a is the action taken in round t (either cooperation or defection) and r is a response function of the form
[6]
where
all terms except n are indexed on t.
This solution causes the response to reinforcement to become discontinuous as
propensities approach the lower bound n. In order to align the model with
Bush-Mosteller, we replaced the discontinuous function in [6] with a more
elegant solution. Equation [7] gives asymptotic lower as well as upper limits
to the probabilites:
[7]