MSS: A manuscript copying simulation

T. J. Finney


[1] This program is an attempt to simulate the production of manuscript copies of a particular work. It is based on a model where audiences at various geographical locations create a demand for MSS. Each audience accumulates a collection of MSS that serves as a source of exemplars for copying and correction. MSS may also be imported from other audiences.

[2] This is nothing more than a first approximation to what is, in reality, a complex phenomenon. My general approach has been to simplify the various processes as much as possible, attempting to walk the line between over-simplification and unfruitful elaboration. This approach has a number of manifestations that may seem surprising at first. For example, all copying and correction is carried out in the same manner, as if the audiences were relying on one poor scribe to do all of the work. In reality there would be many scribes with differing abilities. However, it is reasonable to expect scribes in one region to be as proficient as their counterparts elsewhere. The result of using a single, typical scribe can therefore be expected to approximate results produced by an array of scribes given a large number of copying and correction events. As another example, all audiences are treated as exhibiting the same fundamental behaviour.

[3] The program is stochastic, relying on probabilities to describe the processes of correction, discard, copying, and importation. The user specifies these probabilities along with other parameters in a configuration file ("config.xml"). When running, the program performs trials to determine the outcome of each instance of a modelled process.

Zipf's law

[4] A number of processes employ Zipf's law to calculate the probabilities of various occurrences. Zipf's law, named after George K. Zipf, is a scaling law that describes frequency of use versus popularity ranking. Book loans, video rentals, and web page hits tend to conform to this pattern:

frequency = k x rank-n
where k is a normalisation constant and n is an exponent that is often close to one. When used with an exponent of one, this law states that the most popular item is twice as likely to be chosen as the second most popular item, the third ranked item is one third as likely to be chosen as the first, and so on.

[5] Whenever a MS is added to an audience's collection, it is assigned a rank at random. That is, any new acquisition has an equal chance of ending up anywhere in the ranks from least to most popular. Thereafter, the MS is withdrawn from the collection to serve as an exemplar according to its rank using Zipf's law. Consequently, the texts of some MSS are more likely to be propagated than others.

A short description of the processes

[6] The heart of the program is a cycle that repeats with every year of the simulation. An archetype is introduced at a specified time and place, and is copied according to the demand for MSS generated by each audience. An audience's demand is calculated as follows:

D = (N x r1) - n
where N is the audience population, r1 is the average number of MSS per audience member, and n is the number of MSS in the audience's collection.

[7] The audience population is calculated according to the logistic growth equation,

dN/dt = N x r x (1 - N/K)
where dN/dt is the rate of change of the population N with time, r is the growth rate, and K is the limiting value of the population. This equation produces exponential growth at first, but the growth slows as the population approaches its limit. The user is free to choose whatever values thought appropriate to model the audiences of interest.

[8] Each cycle consists of MS correction, discard, copying, and importation, with the user specifying parameters that determine the frequencies of these processes. Parameters for correction, discard, and importation are directly specified. By contrast, the copying frequency depends on the discard rate, importation rate, and growth equation parameters.

[9] Once the cycles have been completed, a number of MSS are recovered from the ones discarded during the simulation. The number of recovered MSS is determined by the user-specified ratio of recovered to discarded MSS. Respective parts of each year's cycle are discussed below:

Using the program

[16] The program requires a command line for compilation and execution. (Mac users can get a command line by installing OSX.) It is written in Java and makes use of the Apache XML Project's Xerces parser. The Java Virtual Machine must be installed on the user's computer before running the program. The virtual machine and installation instructions can be obtained from Sun Microsystems (

[17] The simulation is installed and run as follows:

  1. create a directory for the program files
  2. download,,,,, config.dtd, config.xml and xerces.jar into the directory
  3. navigate to the directory containing the program files
  4. compile by typing
    javac -classpath xerces.jar *java
  5. run the program by typing
    java -classpath .:xerces.jar MSS
  6. experiment by editing the configuration file "config.xml" then running the program again

[18] The configuration file is an XML document that contains a field for each user-specified parameter. More or less audiences may be included, but none of the other parameters may be omitted. The settings provided in the default configuration file normally result in an output of about 50000 lines and a run time of about 30 seconds on my machine. The output records every process and concludes with the recovered MSS and the final number of states for each unit. Each recovered MS includes a genealogy and array of states. The leftmost MS in the genealogy is the MS itself.

[19] Some parameters such as power law exponents and the number of units in each MS are hidden inside the program's source code (i.e. ".java" files). You are welcome to fiddle with these parameters in your own copy of the program. In order for changes to take effect, altered source code needs to be recompiled by typing "javac <name>.java", where <name> refers to the relevant program component.


[20] Hopefully, this program will provide insights into the MS copying process and facilitate the evaluation of various strategies for mapping textual history when only a sample of MSS is extant. Each run produces a complete account of a copying history and includes information that is usually not available in real corpora (e.g. the number of generations a MS is removed from its archetype). The program may be freely used but not sold.

[21] I would like to acknowledge the help of James R. Adair who suggested a number of improvements, including what I regard to be an elegant method of calculating the probabilities of novel states.
Last modified: Fri May 3 18:00:18 EDT 2002