Replication, Replication and Replication
– Some Hard Lessons from Model Alignment
Bruce Edmonds (bruce@cfpm.org) & David Hales
(dave@davidhales.com)
Centre for Policy Modelling (http://cfpm.org)
Nov. 2002
A published simulation model (Riolo et al. 2001) was replicated in two independent implementations so that the results as well as the conceptual design align. This double replication allowed the original to be analysed and critiqued with confidence. In this case, the replication revealed some weaknesses in the original model, which otherwise might not have come to light. This shows that unreplicated simulation models and their results can not be trusted – as with other kinds of experiment, simulations need to be independently replicated.
Typically in social simulation, the ultimate modelling target is a social process. The purpose of the simulation is to somehow help us understand that process. The ability of the simulation to aid us in this way is dependent on there being some relation between the simulation and the social process. Frequently this relation is indirect - the simulation is a model[1] of an abstract process, which is related to the social phenomena in a rich analogical way in the minds and writing of the researchers. This conceptual model mediates between the simulation and the phenomena, the purpose of the simulation is to inform the conceptual model, but it is only the conceptual model which directly represents the phenomena[2].
Sometimes modellers attempt to go beyond this indirect relationship between simulation and phenomena to establish a more direct mapping. In these cases the results of the simulation are compared to data models of the phenomena. There are two ways of doing this as a post hoc check on its validity or to calibrate the model – i.e. the simulation is often adjusted until it is, to some degree, in agreement with the data. In either case, the data gained from the phenomena is used to constrain the simulation. In social simulation the data is never sufficient to constrain the model down to uniqueness, rather some aspects of the model are usually partially constrained using data, leaving other aspects to be determined in other ways. These “other ways” can include: expert/stakeholder opinion, prior theory, pragmatic considerations (what is sometimes called “simplicity”), and conceptual models. Many simulations are partially constrained by data in this way and partially determined conceptually.
Unlike attempts to relate a model to social phenomena, (up to this point) simulation models have been usually related to other models purely via intermediate conceptual models, which are usually expressed in natural language. This paper explores the extent to which simulation models can be related to other simulation models so that their results, as well as their intended design, are consistent with each other. That is, the extent to which simulation models can be faithfully replicated in different implementations so they give the same results.
In this paper the authors describe their experience of such replication. This involved replicating a simple published model in two different implementations. This experience has indicated some techniques that seem to be helpful in this process.
The overall conclusion of this paper is that aligning models is very difficult, but very revealing. The process revealed a host of minor bugs and ill-defined implementation issues in simulations that otherwise appeared to be working well and according to their specification. Clearly, simply implementing simulations with respect to a conceptual model and then “eyeballing” their outputs for consistency with the conceptual model and data series is insufficient to ensure the correctness of an implementation. This indicates that, almost certainly, the vast majority of published social simulations do not comply with their authors’ intentions. In some of these cases the differences between intentions and simulations may be minor, in the sense that they do not change the “overall” character of the simulation results (the so-called “statistical signature”). In the others, these differences may be important, so that when run with new parameters one would get results inconsistent with the stated conceptual model or analysis.
This problem is well known in computer science – it is the problem of verification. Formal and/or structured implementation techniques have been developed to aid programmers implement a specification correctly and to check whether the implementation is correct. These techniques may also be usefully applied in social simulation – (VUA/DESIRE hierarchical verification example) uses a hierarchical verification on a very simple MAS. However the complexity of most social simulations means that these techniques will be of limited use in this regard, because the phenomena that social simulations focus on are usually precisely those where new properties emerge (Edmonds 1998).
Traditional computer science methods often assume that we know a prior what the output of a program should be – that is, there are some set of functional requirements that the program should satisfy. Due to the exploratory nature of simulation work, and the focus on “emergent properties”, functional requirements often do not exist. The researcher “discovers” what happens when the simulation is run. Given this, it is often the case that the programmer of the simulation model literally does not know what to expect and therefore cannot easily check (from the output of a simulation run) whether the program is conforming to the specification (even if a very precise specification existed – which it often does not!). See David et al. (2002) for a discussion of this issue.
However, this does not mean that it is impossible to check the correctness of implemented social simulations, since simulations can be independently replicated from the conceptual model and their results checked against each other. By independently implementing and executing a simulation from a single model specification and subsequent alignment of those simulations “hidden parameters” can be uncovered and aligned. If the specification does not include important parameters or mechanisms (since their importance may have escaped the original designer) then this is likely to be revealed by inconsistencies in outputs between the two implementations.
On the whole a published specification and results will focus on specific phenomena (or story). This means the number of results given will be small and rarely cover much of the parameter space. The danger in a single re-implementation is therefore to assume that the models are aligned when the (small set of) given published results are sufficiently close to those produced by the re-implementation. However, to have a high degree of confidence in the similarity of the two implementations would require matching results with runs based on different parameters (of the model) from those used during any previous alignment process. This is similar to the requirement for testing and comparing results in Machine Learning (Mitchell 1997) - that one should not assess the goodness of some induction algorithm based on previously “seen” data (even if the data was only “seen” by the programmer of the algorithm[3]).
If two independent re-implementations are carried out then this problem is alleviated since when both implementations produce results sufficiently close to those given in the published account then the two re-implementations can be compared to each other over a large part of the parameter space (by executing runs based on extreme or arbitrary parameters). If the models are implementing the same underlying process then they should produce results that are sufficiently close with new previously unexplored parameter values. If new parameter values result in non-alignment of the results then either 1) the specification is inadequate and needs to be refined or 2) at least one of the implementations is an erroneous implementation of the specification[4]. Both programmers stating a refined specification (based on their implementation) and cross checking can attack the first problem. Traditional debugging techniques can be applied to the second problem. It would seem logical to start with the former and move to the latter.
We re-implemented a model first described by Riolo et al (2001). This model explores how “tags” (observable social cues or markings) can produce co-operation between seemingly[5] self-interested agents. It follows previous models and suggestions (e.g. Holland 1993, Hales 2000)[6].
The model consists of a population of 100 evolving agents. Each agent has two (real valued) traits: a tag τ (where 0 £ τ £ 1) and a tolerance threshold T (where 0 £ T £ 1). Initially the tags and thresholds are allocated uniformly randomly. To start with each agent is given a randomly selected tag value and tolerance value from these ranges.
The simulation is executed for a number of “generations”. In each generation each agent is paired with another agent P times. For each pairing a new agent is randomly select from the population. The randomly selected agent is denoted the “recipient”, the other agent the “donor”. When a pairing occurs the donor decides whether to make a donation to the recipient. A donation is made if the recipients tag is sufficiently similar to the donors tag.
A recipient tag is considered to be sufficiently similar if it is within the tolerance of the donating agent. Specifically, given a potential donor agent D and a potential recipient R a donation will only be made when |τD–τR| £ TD. This means that an agent with a high T value may donate to agents over a large range of tag values. A low value for T restricts donation to agents with very similar tag values to the donor. In all cases donation can only occur when the skill type of the receiving agent matches the skill type associated with the resource. If a donation is made the donating agent incurs a cost, c, and the recipient gains a benefit, b. In all experiments given here, the benefit b = 1 but the cost c is varied as is the number of pairings.
After all agents have been paired P times and made any possible donations the entire population is reproduced. Reproduction is accomplished in the following manner – each agent is selected from the population in turn, its score is compared to another randomly chosen agent, and the one with the highest score is reproduced into the next generation (a form of “tournament selection”). The scores are not carried over into the next generation but calculated afresh each time. Mutation is applied to each trait of each offspring. With probability = 0.1 the offspring receives a new tag (uniformly randomly selected). With the same probability, Gaussian noise is added to the tolerance value (mean 0, standard deviation 0.01). When T < 0 or T > 1, it is reset to 0 and 1 respectively.
Table 1 and Table 2 below show the results given in Riolo et al (2001). Each of these tables shows the average values of the donation rate and tolerance over all the individuals over 30,000 generations and 30 independent runs.
|
Effect of pairings on donation rate |
||
|
Parings |
Donation rate (%) |
Average tolerance |
|
1 |
2.1 |
0.009 |
|
2 |
4.3 |
0.007 |
|
3 |
73.6 |
0.019 |
|
4 |
76.8 |
0.021 |
|
6 |
78.7 |
0.024 |
|
8 |
79.2 |
0.025 |
|
10 |
79.2 |
0.024 |
Table 1. Pairings is the number of times per generation each agent has an opportunity to donate to a randomly encountered other. The donation rate is the percentage of such encounters in which the choosing agent cooperates, that is, donates b = 1.0 at a cost of c = 0.1 to itself. The average tolerance is the average over all agents and all generations.
|
Effect of cost of donating on donation rate |
||
|
Cost |
Donation rate (%) |
Average tolerance |
|
0.05 |
73.7 |
0.019 |
|
0.1 |
73.6 |
0.019 |
|
0.2 |
73.6 |
0.018 |
|
0.3 |
73.5 |
0.018 |
|
0.4 |
60.1 |
0.011 |
|
0.5 |
24.7 |
0.007 |
|
0.6 |
2.2 |
0.005 |
Table 2. The number of pairings is held at P = 3. The recipient benefit his held at b = 1 and the cost to the donor is varied as shown.
These results showed that in this simulation with 100 agents, if one had at least 3 pairings per individual each generation, then you would get a high rate of donation (>60% of all pairings). Similarly this effect holds for cost of donation being as high as 0.4. This indicated that in this model that fairly robust “altruistic” donation was sustainable in this simulation model.
The interpretation that the Riolo et al. paper gave was that the tolerance played a role similar to the defector/co-operator in the Prisoner’s Dilemma problem. A very low tolerance indicated an individual who would not donate to many others (i.e. a defector) whilst an individual with a high tolerance would donate to others (i.e. a co-operator). They summarised their results as follows:
“Strategies of donating to others who have sufficiently similar heritable tags … can establish cooperation without reciprocity”. (Riolo et al. 2001, page 443)
Below are the results from the first re-implementation of the Riolo model (in Java by David Hales). They are also over 30,000 time periods and 30 runs.
|
Effect of pairings on donation rate |
||
|
Parings |
Donation rate (%) |
Average tolerance |
|
1 |
5.1 (3.0) |
0.010 (0.1) |
|
2 |
42.6 (38.3) |
0.012 (0.5) |
|
3 |
73.7 (0.1) |
0.018 (0.1) |
|
4 |
76.8 (0.0) |
0.021 (0.0) |
|
6 |
78.6 (0.1) |
0.023 (0.1) |
|
8 |
79.2 (0.0) |
0.025 (0.0) |
|
10 |
79.4 (0.2) |
0.026 (0.2) |
Table 3. Pairings is the number of times per generation each agent has an opportunity to donate to a randomly encountered other. The donation rate is the percentage of such encounters in which the choosing agent cooperates, that is, donates b = 1.0 at a cost of c = 0.1 to itself. The average tolerance is the average over all agents and all generations. The values in brackets show the difference between these results and the original results given in table 1.
|
Effect of cost of donating on donation rate |
||
|
Cost |
Donation rate (%) |
Average tolerance |
|
0.05 |
73.7 (0.0) |
0.018 (0.001) |
|
0.1 |
73.7 (0.1) |
0.018 (0.001) |
|
0.2 |
73.7 (0.1) |
0.019 (0.001) |
|
0.3 |
73.7 (0.2) |
0.018 (0.000) |
|
0.4 |
61.0 (0.9) |
0.011 (0.000) |
|
0.5 |
45.9 (21.2) |
0.008 (0.001) |
|
0.6 |
8.1 (5.9) |
0.006 (0.001) |
Table 4. The number of pairings is held at P = 3. The recipient benefit his held at b = 1 and the cost to the donor is varied as shown. The values in brackets show the difference between these results and the original results given in table 2.
The results broadly agreed with those of the Riolo model, but the thresholds at which the high donation effect disappeared had shifted. In these results one was getting significant donation rates for only 2 pairings and for costs up to 0.5. However it was at these thresholds that the re-implemented runs showed the greatest variance. With only a limited description and data in the original paper it is difficult to determine where the error was. For this reason the other author re-implemented the model in a different language (SDML[7]), to see if the Riolo et al. results could be implemented.
The results were only generated for 1 run up to 5000 generations and for up to 6 pairings (so confidence over results is lower than for the averages over 30 runs to 30,0000 generation so far) but they strongly indicate that the results match – although here we merely “eyeball” the results and intuit their similarity.
|
Effect of pairings on donation rate |
||
|
Parings |
Donation rate (%) |
Average tolerance |
|
1 |
5.2 (0.1) |
0.011 (0.1) |
|
2 |
42.6 (0.0) |
0.012 (0.0) |
|
3 |
73.3 (0.4) |
0.018 (0.0) |
|
4 |
76.6 (0.2) |
0.022 (0.1) |
|
6 |
78.6 (0.0) |
0.023 (0.0) |
Table 5. Pairings is the number of times per generation each agent has an opportunity to donate to a randomly encountered other. The donation rate is the percentage of such encounters in which the choosing agent cooperates, that is, donates b = 1.0 at a cost of c = 0.1 to itself. The average tolerance is the average over all agents and all generations. The values in brackets show the difference between these results and the initial re-implementation results given in table 3.
In order to further verify that the two re-implementations matched, results were compared over other areas of the parameter space. It was found (after some checking and debugging) that Implementations A and B agreed very closely over a wide variety of parameter settings (we did not find a set of parameter values where they disagreed).
Thus we had three models and three sets of results, so that two sets of results aligned but both disagreed with the published conceptual model and results. The situation is illustrated in Figure 1.

Figure 1. Relationship of published model and two re-implementations
Given that two independent implementations match each other but show significant differences from the original published results we speculated that there were three possible sources of the inconsistency:
· The implementation used to produce the published results did not match the published conceptual model.
· Some aspect of the conceptual model was not clearly stated in the published article
· Both re-implementations had somehow been independently and incorrectly implemented (in the same way).
In some sense these points are not distinct issues but are very much related. It is essentially a matter of opinion as to the sufficiency and clarity of a natural language description of a “conceptual model” for the purposes of re-implementation. However, our argument is that if two independent re-implementations appear to make “the same mistake” then the problem is more likely to lay with the conceptual model description. This would seem to be especially true when the re-implementations align themselves even on previously unseen results from different parts of the parameter space.
In order to proceed we reconsidered the original natural language specification of the conceptual model given be Riolo et al. Where we found ambiguity we tried alternative implementations that conformed to the specification.
The problem of interpretation was identified in the tournament selection procedure for reproduction. In the original paper it is described thus:
After all agents have participated in all parings in a generations agents are reproduced on the basis of their score relative to others. The least fit, median fit, and most fit agents have respectively 0, 1 and 2 as the expected number of their offspring. This is accomplished by comparing each agent with another randomly chosen agent, and giving an offspring to the on with the higher score.
It turned-out that in both re-implementations the authors had assumed that when an agent has an identical score to a randomly chosen agent then a random choice is made between them to decide which to reproduce into the next generation (since this is left unspecified in the text). However, when a comparison was made of alternative re-implementations with different reproduction rubrics (when both agents have the same score) it was found that the original implementation must have incorporated a bias towards the systematically selected agent (not the randomly selected agent).
Let us be clearer now (with pseudo-code) as to the three different possible rubrics of reproduction we implemented. Table 6 gives outline algorithms for each of the three rubrics that were implemented.
|
Three Variants Of Tournament Selection |
||
|
LOOP for each agent in population Select current agent (a) from pop Select random agent (b) from pop IF score (a) > score (b) THEN Reproduce (a) in next generation ELSE IF score (a) < score (b) THEN Reproduce (b) in next generation ELSE (a) and (b) are equal Select randomly (a) or (b) to be reproduced into next generation. END IF END LOOP |
LOOP for each agent in population Select current agent (a) from pop Select random agent (b) from pop IF score (a) >= score (b) THEN Reproduce (a) in next generation ELSE score (a) < score (b) Reproduce (b) in next generation END IF END LOOP |
LOOP for each agent in population Select current agent (a) from pop Select random agent (b) from pop IF score (a) <= score (b) THEN Reproduce (b) in next generation ELSE score (a) > score (b) Reproduce (a) in next generation END IF END LOOP |
|
a) No Bias |
b) Selected Bias |
c) Random Bias |
Table 6. Outline pseudo-code algorithms for the three different tournament selection methods tested.
It was found that the authors of both re-implementations had independently assumed that the “no bias” algorithm (Table 6a) was the correct interpretation of the natural language description given in the original published article (see above). However, it was determined that the “selected bias” algorithm (Table 6b) reproduced the results given. Hence it would appear that Riolo et al, used the “selected bias” method and that this was the reason for the different results obtained from the two previous re-implementations. Tables 7 and 8 shows results for each of the algorithms given in Table 6. It is worth noting that the “selected bias” method almost perfectly matches the published results but that this method produces different results from the other two (“no bias” and “random bias”) methods.
|
|
Results From The Three Variants Of Tournament Selection |
|||||
|
|
No Bias (a) |
Selected Bias (b) |
Random Bias (c) |
|||
|
Parings |
Don |
Ave. Tol |
Don |
Ave Tol |
Don |
Ave Tol |
|
1 |
5.1 |
0.010 |
2.1 |
0.009 |
6.0 |
0.010 |
|
2 |
42.6 |
0.012 |
4.4 |
0.007 |
49.6 |
0.013 |
|
3 |
73.7 |
0.018 |
73.7 |
0.019 |
73.7 |
0.018 |
|
4 |
76.8 |
0.021 |
76.9 |
0.021 |
76.8 |
0.021 |
|
6 |
78.6 |
0.023 |
78.6 |
0.023 |
78.7 |
0.023 |
|
8 |
79.2 |
0.025 |
79.2 |
0.025 |
79.2 |
0.025 |
|
10 |
79.4 |
0.026 |
79.4 |
0.026 |
79.4 |
0.026 |
Table 7. Here we compare the results of donation rates (Don) and average tolerances (Ave. Tol) for each of the three different tournament selection algorithms described in table 6 over different “pairings” values. We note that the “selected bias” method very closely matches the published results from the initial model (given in table 1). As before the results are calculated from 30 independent runs to 30,000 generations.
|
|
Results From The Three Variants Of Tournament Selection |
|||||
|
|
No Bias (a) |
Selected Bias (b) |
Random Bias (c) |
|||
|
Cost |
Ave. Don |
Ave. Tol |
Ave Don |
Ave Tol |
Ave Don |
Ave Tol |
|
0.05 |
73.7 |
0.018 |
73.6 |
0.018 |
73.7 |
0.018 |
|
0.1 |
73.7 |
0.018 |
73.7 |
0.019 |
73.7 |
|