A more general model of cooperation based on reinforcement

learning: Alignment and Integration of the Bush-Mosteller and the

Roth-Erev model*

 

Paper to be presented at the Model To Model (M2M) Workshop, March 31st -  April 1st, Marseille, France.

 

 

 

Andreas Flache

Department of Sociology, University of Groningen

Grote Rozenstraat 31. 9712 TG  Groningen, The Netherlands

a.flache@ppsw.rug.nl

 

Michael W. Macy

Department of Sociology, Cornell University

Ithaca N.Y. 14853

mwm14@cornell.edu

 


Stochastic Collusion and the Power Law of Learning:

Aligning and Integrating the Bush-Mosteller and the Roth-Erev Reinforcement Learning Models of Cooperation

Abstract

Analytical game theory has developed the Nash equilibrium as theoretical tool for the analysis of cooperation and conflicts in interdependent decision making. Indeterminacy and demanding rationality assumptions of the Nash equilibrium have led cognitive game theorists to explore learning-theoretic models of behavior. Two prominent examples are the Bush-Mosteller stochastic learning model and the Roth-Erev payoff-matching model. We align and integrate the two models as special cases of a General Reinforcement Learning Model. Both models predict stochastic collusion as a backward-looking solution to the problem of cooperation in social dilemmas, based on a random walk into a self-reinforcing cooperative equilibrium. The integration also uncovers hidden assumptions that constrain the generality of the theoretical derivations. Specifically, Roth and Erev assume a “Power Law of Learning” - the curious but plausible tendency for learning to diminish with success and intensify with failure, which we call “fixation.” We use computer simulation to explore the effects of fixation on stochastic collusion in three social dilemma games. The analysis shows how the integration of alternative models can uncover underlying principles and lead to a more general theory.


1. Introduction: Towards A Cognitive Game Theory

Conflicts as diverse as those between neighbors, friends, business associates, ethnic groups, or states, often ensue from the same fundamental logic. On the one hand the parties involved have common interests, such as avoidance of an all out nuclear war, on the other hand they may fall prey to the temptation to pursue individual agendas at the expense of the commons. To explore how and under what conditions such relationships may proceed cooperatively rather than conflictuous, game theory formalizes the problem of cooperation. At the most elementary level, the theory represents relationships as a mixed-motive two-person game with two choices, cooperate and defect. These choices intersect at four possible outcomes, abbreviated as CC, CD, DD, and DC. Each outcome has an associated payoff: R (reward), S (sucker), P (punishment) and T (temptation), respectively. Using these payoffs, we define a two-person social dilemma[1] as any ordering of these payoffs such that mutual cooperation is Pareto optimal yet may be undermined by the temptation to cheat (if T>R) or by the fear of being cheated (if P>S) or by both. In the game of “Stag Hunt” the problem is “fear” but not “greed” (R>T>P>S), and in the game of “Chicken” the problem is “greed” but not “fear” (T>R>S>P). The problem is most challenging when both fear and greed are present, that is, when T>R and P>S. Given the assumption that R>P, there is only one way this can happen, if T>R>P>S, the celebrated game of “Prisoner’s Dilemma” (PD).

The Nash equilibrium[2] - the main solution concept in analytical game theory - predicts mutual defection in PD, unilateral defection in Chicken, and either mutual cooperation or mutual defection in Stag Hunt. However, Nash cannot make precise predictions about the selection of supergame equilibria, that is, about the outcome of on-going mixed-motive games. Nor can it tell us much about the dynamics by which a population of players can move from one equilibrium to another. These limitations, along with concerns about the cognitive demands of forward-looking rationality (Dawes and Thaler 1988; Weibull 1998; Fudenberg and Levine 1998), have led game theorists to explore models of cognition that explicitly describe the dynamics of stepwise decision making.  This is reflected in a growing number of formal learning-theoretic models of cooperative behavior (Macy 1991; Roth and Erev 1995; Fudenberg and Levine 1998; Peyton-Young 1998; Cohen et. al. 2001). In learning, positive outcomes increase the probability that the associated behavior will be repeated, while negative outcomes reduce it.

In general form, these simple game-theoretic learning models consist of a probabilistic decision rule and a learning algorithm in which game payoffs are evaluated relative to an aspiration level, and the corresponding choice propensities are updated accordingly. The schematic model is diagrammed in Figure 1.

 

 

 

 

 


Figure 1. Schematic model of reinforcement learning.

The first step in Figure 1 is the decision by each player whether to cooperate or defect. This decision is probabilistic, based on the player’s current propensity to cooperate. The resulting outcome then generates payoffs (R, S, P, or T) that the players evaluate as satisfactory or unsatisfactory relative to their aspiration level. Satisfactory payoffs present a positive stimulus (or reward) and unsatisfactory payoffs present a negative stimulus (or punishment). These rewards and punishments then modify the probability of repeating the associated choice, such that satisfactory choices become more likely to be repeated, while unsatisfactory choices become less likely.

Erev, Roth and others (Roth and Erev 1995; Erev and Roth 1998; Erev et al. 1999) used a learning model to estimate globally applicable parameters from data collected across a variety of human subject experiments. They concluded that “low rationality” models of reinforcement learning may often provide a more accurate prediction than orthodox game theoretical analysis. Macy (1995) and Flache (1996) have also tested a simple reinforcement learning algorithm in controlled social dilemma experiments and found supporting evidence.

Learning models have also been applied outside the laboratory to mixed-motive games played in everyday life, in which cooperation is largely unthinking and automatic, based on heuristics, habits, routines, or norms, such as the propensity to loan a tool to a neighbor, tell the truth, or trouble oneself to vote. For example, Kanazawa (2000) used voting data to show how a simple learning model suggested by Macy (1991) solves the “voter paradox” in public choice theory.

While interest in cognitive game theory is clearly growing, recent studies have been divided by disciplinary boundaries between sociology, psychology, and economics that have obstructed the theoretical integration needed to isolate and identify the learning principles that underlie new solution concepts. Our paper aims to move towards such an integration. We align two prominent models in cognitive game theory and integrate them into a general model of adaptive behavior in two-person mixed-motive games (or social dilemmas). In doing so, we will identify a set of model-independent learning principles that are necessary and sufficient to generate cooperative solutions. We begin by briefly summarizing the basic principles of reinforcement learning and then introduce two formal specifications of these principles, the Bush-Mosteller stochastic learning model and the Roth-Erev payoff matching model.

1.1 Learning Theory and the Law of Effect

In reinforcement learning theory, the search for solutions is “backward-looking” (Macy 1990) in that it is driven by experience rather than the forward-looking calculation assumed in analytical game theory (Fudenberg and Levine 1998; Weibull 1998). Thorndike (1898) formulated this as the “Law of Effect,” based on the cognitive psychology of William James. If a behavioral response has a favorable outcome, the neural pathways that triggered the behavior are strengthened, which “loads the dice in favor of those of its performances which make for the most permanent interests of the brain’s owner” (James 1981, p. 143). This connectionist theory anticipates the error back-propagation used in contemporary neural networks (Rummelhart and McLelland 1988). These models show how highly complex behavioral responses can be acquired through repeated exposure to a problem. For example, with a little genetic predisposition and a lot of practice, we can learn to catch a ball while running at full speed, without having to stop and calculate the trajectory (as the ball passes overhead).

More precisely, learning theory relaxes three key behavioral assumptions in analytical game-theoretic models of decision:

·         Propinquity replaces causality as the link between choices and payoffs. Learning theory assumes experiential induction rather than logical deduction. Players explore the likely consequences of alternative choices and develop preferences for those associated with better outcomes, even though the association may be coincident, “superstitious,” or causally spurious.

·         Reward and punishment replace utility as the motivation for choice. Learning theory differs from game-theoretic utility theory in positing two distinct cognitive mechanisms that guide decisions toward better outcomes, approach (driven by reward) and avoidance (driven by punishment). The distinction means that aspiration levels are very important for learning theory. The effect of an outcome depends decisively on whether it is coded as gain or loss, satisfactory or unsatisfactory, pleasant or aversive.

·         Melioration replaces optimization as the basis for the distribution of choices over time. Melioration refers to suboptimal gradient climbing when confronted with “distributed choice” (Herrnstein and Drazin 1991) across recurrent decisions. Melioration implies a tendency to repeat choices with satisfactory outcomes even if other choices have higher utility, a behavioral tendency March and Simon (1958) call “satisficing.” In contrast, unsatisfactory outcomes induce search for alternative outcomes, including a tendency to revisit alternative choices whose outcomes are even worse, a pattern we call “dissatisficing.”

The Law of Effect does not solve the social dilemma, it merely reframes it: Where the penalty for cooperation is larger than the reward, and the reward for aggressive behavior is larger than the penalty, how can penalty-aversive, reward-seeking actors elude the trap of mutual punishment?

1.2 The Bush-Mosteller Stochastic Learning Model

The earliest answer was given by Rapoport and Chammah (1965), who used learning theory to propose a Markov model of Prisoner’s Dilemma with state transition probabilities given by the payoffs for each state, based on the assumption that each player is satisfied only when the partner cooperates. With choice probabilities updated after each move based on the Law of Effect, mutual cooperation is an absorbing state in the Markov chain.

Macy (1990, 1991) elaborated Rapoport and Chammah’s analysis using computer simulations of their Bush-Mosteller stochastic learning model. Macy identified two learning-theoretic equilibria in the PD game, corresponding to each of the two learning mechanisms, approach and avoidance. Approach implies a self-reinforcing equilibrium (SRE) characterized by satisficing behavior. The SRE obtains when a strategy pair yields payoffs that are mutually rewarding. The SRE can obtain even when both players receive less than their optimal payoff (such as the R payoff for mutual cooperation in the PD game), so long as this payoff exceeds aspirations. Avoidance implies an aversive self-correcting equilibrium (SCE) characterized by dissatisficing behavior. Dissatisficing means that both players will try to avoid an outcome that is better than their worst possible payoff (such as P in the PD game), so long as this payoff is below aspirations. The SCE obtains when the expected change of probabilities is zero and there is a positive probability of punishment as well as reward. This happens when outcomes that reward cooperation or punish defection (causing the probability of cooperation to increase) balance outcomes that punish cooperation or reward defection (causing the probability to decline).[3] At equilibrium, the dynamics pushing the probability higher are balanced by the dynamics pushing in the other direction, like a tug-of-war between two equally strong teams.

These learning theoretic equilibria differ fundamentally from the Nash predictions in that agents have an incentive to unilaterally change strategy. In the SRE, both players have an incentive to deviate from unconditional mutual cooperation, but they stay the course so long as the R payoff exceeds their aspirations. Unlike the Nash equilibrium, the SCE has players constantly changing course, but their efforts are self-defeating. It is not that everyone decides they are doing the best they can, it is that their efforts to do better set into motion a dynamic that pulls the rug out from under everyone.

Suppose two players in PD are each satisfied only when the partner cooperates, and each starts out with zero probability of cooperation. They are both certain to defect, which then causes both probabilities to increase (as an avoidance response). Paradoxically, an increased probability of cooperation now makes a unilateral outcome (CD or DC) more likely than before, and these outcomes punish the cooperator and reward the defector, causing both probabilities to drop. Nevertheless, there is always the chance that both players will defect anyway, causing probabilities to rise further still. Once both probabilities exceed 0.5, further increases become self-reinforcing, by increasing the chances for another bilateral move, and this move is now more likely to be mutual cooperation instead of mutual defection. In short, the players can escape the social trap through stochastic collusion, characterized by a chance sequence of bilateral moves in a “drunkard’s walk.” A fortuitous string of bilateral outcomes can increase cooperative probabilities to the point that cooperation becomes self-sustaining - the drunkard wanders out of the gully.

1.3 The Roth-Erev Payoff-Matching Model

More recently, Roth and Erev (Roth and Erev 1995; Erev and Roth 1998; Erev et al. 1999) have proposed a learning-theoretic alternative to the earlier Bush-Mosteller formulation. Their model draws on the “matching law” which holds that adaptive actors will choose between alternatives in a ratio that matches the ratio of reward. Applied to social dilemmas, the matching law predicts that players will learn to cooperate to the extent that the payoff for cooperation exceeds that for defection, which is possible only if both players happen to cooperate and defect at the same time (given R>P). As with Bush-Mosteller, the path to cooperation is a sequence of bilateral moves.

Like the Bush-Mosteller stochastic learning model, the Roth-Erev payoff matching model implements the three basic principles that distinguish learning from utility theory – experiential induction (vs. logical deduction), reward and punishment (vs. utility), and melioration (vs. optimization). The similarity in substantive assumptions makes it tempting to assume that the two models are mathematically equivalent, or if not, that they nevertheless give equivalent solutions.

On closer inspection, however, we find important differences. Each specification implements reinforcement learning in different ways, and with different results. Without a systematic theoretical alignment and integration of the two algorithms, it is not clear whether and under what conditions the backward-looking solution for social dilemmas identified with the Bush-Mosteller (BM) specification generalizes to the Roth-Erev (RE) model.

Those assumptions can be brought to the surface by close comparison of competing models and by integrating alternative specifications as special cases of a more general model. This is especially important for models that must rely on computational rather than mathematical methods. “Without such a process of close comparison, computational modeling will never provide the clear sense of ‘domain of validity’ that typically can be obtained for mathematized theories” (Axtell et al. 1996, p. 123).

Unfortunately, learning models have been divided by disciplinary boundaries between sociology, psychology, and economics that have obstructed the theoretical integration needed to isolate and identify the learning principles that underlie new solution concepts. Accordingly, this paper aims to align two prominent models in cognitive game theory and to integrate them into a general model of adaptive behavior in two-person mixed-motive games. By “docking” (Axtell et al. 1996) the BM stochastic learning model with the RE payoff-matching model, we can identify learning principles that generalize beyond particular specifications, and at the same time, uncover hidden assumptions that explain differences in the outcomes and constrain the generality of the theoretical derivations.

In section 2 that follows, we formally align the two learning models and integrate them into a General Reinforcement Learning (GRL) model, for which the Bush-Mosteller and Roth-Erev models are special cases. This process brings to the surface a key hidden assumption with important implications for the generality of learning-theoretic solution concepts. Section 3 then uses computer simulations of the integrated model to explore the effects of model differences across a range of two-person social dilemma games. The analyses confirm the generality of stochastic collusion but also show that its determinants depend decisively on whether BM or RE assumptions about learning dynamics are employed.

 

2. The Bush-Mosteller and Roth-Erev Specifications

In general form, both the BM and RE models implement the stochastic decision process diagrammed in Figure 1, in which choice propensities are updated by the associated outcomes. Thus, both models imply the existence of some aspiration level relative to which cardinal payoffs can be positively or negatively evaluated. Formally, the stimulus s associated with action a is calculated as

                       [1]

where pa is the payoff associated with action a (R or S if = C, and T or P if = D) and sa is a positive or negative stimulus derived from pa. The denominator in [1] represents the upper value of the set of possible differences between payoff and aspiration. With this scaling factor, stimulus s is always equal to or less than unity in absolute value, regardless of the magnitude of the corresponding payoff.[4]

Neither model imposes constraints on the determinants of aspirations. Whether aspirations are high or low or habituate with experience depends on assumptions that are exogenous to both models.

2.1 Aligning the two models

The two models diverge at the point that these evaluations are used to update choice probabilities. The Bush-Mosteller stochastic learning algorithm updates probabilities following an action a (cooperation or defection) as follows:

                                        [2]

In equation [2], pa,t is the probability of action a at time t and  is the positive or negative stimulus given by [1]. The change in the probability for the action not taken, b, obtains from the constraint that probabilities always sum to one, i.e. . The parameter l is a constant (0 < l < 1) that scales the learning rate. With l » 0, learning is very slow, and with » 1, the model approximates a “win-stay, lose-shift” strategy (Catania 1992).

For any value of l, Equation 2 implies a decreasing effect of reward as the associated propensity approaches unity, but an increasing effect of punishment. Similarly, as the propensity approaches zero, there is a decreasing effect of punishment and a growing effect of reward. This constrains probabilities to approach asymptotically their natural limits.

Like the Bush-Mosteller, the Roth-Erev model is stochastic, but the probabilities are not equivalent to propensities. Propensities are a function of cumulative satisfication and dissatisfaction with the associated choices, and probabilities are a function of the ratio of propensities. More precisely, the propensity q for action a at time T is the sum of all stimuli  a player has ever received when playing a:

                                                                                         [3]

Roth and Erev then use a “probabilistic choice rule” to translate propensities into behavior. The probability pa of action a at time t+1 is the propensity for a divided by the sum of the propensities at time t:

                                                               [4]

 

where a and b represent the binary choices to cooperate or defect. Following action a, the associated propensity qa increases if the payoff is positive relative to aspirations (by increasing the numerator in [4]) and decreases if negative. The propensity for b remains constant, but the probability of b declines (by increasing the denominator in the equivalent expression for pb,t+1).

An obvious problem with the specification of equations [3] and [4] (but not [2]) is the possibility of negative probabilities if punishment dominates reinforcement. In their original model, Erev, Roth and Bereby-Meyer (1999) circumvent this problem with the ad hoc addition of a “clipping” rule. More precisely, their implementation assumes that the new propensity qa,t+1 = qa,t + sa,t  except for the case where this sum drops below a very small positive constant n. In that case, the new propensity is reset to n. Equations [5] and [6] represent the RE clipping rule by rewriting [3] as

                                                                               [5]

 

where a is the action taken in round t (either cooperation or defection) and r is a response function of the form

                                                           [6]

 

where all terms except n are indexed on t. This solution causes the response to reinforcement to become discontinuous as propensities approach the lower bound n. In order to align the model with Bush-Mosteller, we replaced the discontinuous function in [6] with a more elegant solution. Equation [7] gives asymptotic lower as well as upper limits to the probabilites:

                                                 [7]