MSS: A manuscript copying simulation

T. J. Finney

Introduction

[1] This program is an attempt to simulate the production of manuscript copies of a particular work. It is based on a model where audiences at various geographical locations create a demand for MSS. Each audience accumulates a collection of MSS that serves as a source of exemplars for copying and correction. MSS may also be imported from other audiences.

[2] This is nothing more than a first approximation to what is, in reality, a complex phenomenon. My general approach has been to simplify the various processes as much as possible, attempting to walk the line between over-simplification and unfruitful elaboration. This approach has a number of manifestations that may seem surprising at first. For example, all copying and correction is carried out in the same manner, as if the audiences were relying on one poor scribe to do all of the work. In reality there would be many scribes with differing abilities. However, it is reasonable to expect scribes in one region to be as proficient as their counterparts elsewhere. The result of using a single, typical scribe can therefore be expected to approximate results produced by an array of scribes given a large number of copying and correction events. As another example, all audiences are treated as exhibiting the same fundamental behaviour.

[3] The program is stochastic, relying on probabilities to describe the processes of correction, discard, copying, and importation. The user specifies these probabilities along with other parameters in a configuration file ("config.xml"). When running, the program performs trials to determine the outcome of each instance of a modelled process.

Zipf's law

[4] A number of processes employ Zipf's law to calculate the probabilities of various occurrences. Zipf's law, named after George K. Zipf, is a scaling law that describes frequency of use versus popularity ranking. Book loans, video rentals, and web page hits tend to conform to this pattern:

frequency = k x rank^-n

where k is a normalisation constant and n is an exponent that is often close to one. When used with an exponent of one, this law states that the most popular item is twice as likely to be chosen as the second most popular item, the third ranked item is one third as likely to be chosen as the first, and so on.

[5] Whenever a MS is added to an audience's collection, it is assigned a rank at random. That is, any new acquisition has an equal chance of ending up anywhere in the ranks from least to most popular. Thereafter, the MS is withdrawn from the collection to serve as an exemplar according to its rank using Zipf's law. Consequently, the texts of some MSS are more likely to be propagated than others.

A short description of the processes

[6] The heart of the program is a cycle that repeats with every year of the simulation. An archetype is introduced at a specified time and place, and is copied according to the demand for MSS generated by each audience. An audience's demand is calculated as follows:

D = (N x r1) - n

where N is the audience population, r1 is the average number of MSS per audience member, and n is the number of MSS in the audience's collection.

[7] The audience population is calculated according to the logistic growth equation,

dN/dt = N x r x (1 - N/K)

where dN/dt is the rate of change of the population N with time, r is the growth rate, and K is the limiting value of the population. This equation produces exponential growth at first, but the growth slows as the population approaches its limit. The user is free to choose whatever values thought appropriate to model the audiences of interest.

[8] Each cycle consists of MS correction, discard, copying, and importation, with the user specifying parameters that determine the frequencies of these processes. Parameters for correction, discard, and importation are directly specified. By contrast, the copying frequency depends on the discard rate, importation rate, and growth equation parameters.

[9] Once the cycles have been completed, a number of MSS are recovered from the ones discarded during the simulation. The number of recovered MSS is determined by the user-specified ratio of recovered to discarded MSS. Respective parts of each year's cycle are discussed below:

[10] Correction: MSS are selected for correction according to a probability specified by the user. Each MS has an equal probability of being selected, independent of its popularity ranking. Once a MS is selected for correction, the program selects an exemplar according to rank, and generates a number between zero and one, with any value in that interval being equally probable. This number determines the probability that a difference between the exemplar and MS will be corrected in the MS. For example, if a value of 0.25 is generated, then the probability of a differing unit being corrected is one in four.
[11] Discard: MSS are discarded with a frequency determined by the annual discard probability. They are placed in a bin ready for possible recovery at the end of the simulation. MSS that have not been discarded are not recovered at the end.
[12] Copying: Copying consists of transferring units from an exemplar to a copy. Here, "unit" means a potential variation unit--one that does not become an actual variation unit until it attains more than one state (i.e. reading). Alternatively, the user may prefer to think of units as representing individual words.

[13] Every time a unit is copied, the program performs a trial to determine whether it is correctly copied. If so, the exemplar's state is transferred to the copy. If not, either a novel state or an existing state of the particular unit is inserted. The program uses the following formula to calculate the probability of choosing a novel reading:
p(novel) = s^-n
where s is the number of existing states and n is a suitable exponent. According to this formula, the probability of creating a new state decreases as the number of states increases, more or less quickly depending on the value of exponent chosen. (I have settled on a value of 1.5.) If the number of states is one, the resultant probability is one, thereby guaranteeing that a novel reading will be created and that the number of states will increase to two. If a novel state is not selected then the program chooses an existing state of the unit, but not the exemplar's state.

[14] The states are ranked. Whenever a novel state is created, it is inserted at random into the array of states for the relevant unit. If an existing state is to be chosen, then the program makes the selection according to Zipf's law. This makes some states more likely to be propagated than others.
[15] Importation: The user specifies a global ratio of imported to local MSS which determines how often MSS are imported into a collection. The source collection is chosen according to the great circle distance between the respective audiences. A power law is then used to calculate the relative probabilities of import from each place:
p(import) = k x d^-n
where k is a normalisation constant, d is the distance, and n is a suitable exponent value. (The program currently uses a value of 1.0.) The MS to be imported is chosen at random from the selected source collection.

Using the program

[16] The program requires a command line for compilation and execution. (Mac users can get a command line by installing OSX.) It is written in Java and makes use of the Apache XML Project's Xerces parser. The Java Virtual Machine must be installed on the user's computer before running the program. The virtual machine and installation instructions can be obtained from Sun Microsystems (http://java.sun.com/products/).

[17] The simulation is installed and run as follows:

create a directory for the program files
download Audience.java, MS.java, MSS.java, Scribe.java, Stats.java, config.dtd, config.xml and xerces.jar into the directory
navigate to the directory containing the program files
compile by typing
javac -classpath xerces.jar *java
run the program by typing
java -classpath .:xerces.jar MSS
experiment by editing the configuration file "config.xml" then running the program again

[18] The configuration file is an XML document that contains a field for each user-specified parameter. More or less audiences may be included, but none of the other parameters may be omitted. The settings provided in the default configuration file normally result in an output of about 50000 lines and a run time of about 30 seconds on my machine. The output records every process and concludes with the recovered MSS and the final number of states for each unit. Each recovered MS includes a genealogy and array of states. The leftmost MS in the genealogy is the MS itself.

[19] Some parameters such as power law exponents and the number of units in each MS are hidden inside the program's source code (i.e. ".java" files). You are welcome to fiddle with these parameters in your own copy of the program. In order for changes to take effect, altered source code needs to be recompiled by typing "javac <name>.java", where <name> refers to the relevant program component.

Conclusion

[20] Hopefully, this program will provide insights into the MS copying process and facilitate the evaluation of various strategies for mapping textual history when only a sample of MSS is extant. Each run produces a complete account of a copying history and includes information that is usually not available in real corpora (e.g. the number of generations a MS is removed from its archetype). The program may be freely used but not sold.

[21] I would like to acknowledge the help of James R. Adair who suggested a number of improvements, including what I regard to be an elegant method of calculating the probabilities of novel states.

http://purl.org/TC/downloads/simulation/ReadMe.html

Last modified: Fri May 3 18:00:18 EDT 2002