Example
Due to the availability of extensive collation data for the Greek New Testament, and because this project was originally developed for use with such data, we tested this library on a sample collation of the book of Ephesians in thirty-eight textual witnesses (including manuscripts, correctors’ hands, translations to other languages, and quotations from church fathers). The manuscript transcriptions used for this collation were those produced by the University of Birmingham’s Institute for Textual Scholarship and Electronic Editing (ITSEE) for the International Greek New Testament Project (IGNTP); they are freely accessible at https://itseeweb.cal.bham.ac.uk/epistulae/XML/igntp.xml. To achieve a balance between variety and conciseness, we restricted the collation to a set of forty-two variation units in Ephesians corresponding to variation units in the United Bible Societies Greek New Testament, which highlights variation units that affect substantive matters of translation.
In our example collation, witnesses are described in the listWit
element under the teiHeader
. Because most New Testament witnesses
are identified by numerical Gregory-Aland identifiers, these witnesses
are identified with @n
attributes; the recommended practice is to
identify such elements by @xml:id
attributes, but this software is
designed to work with either identifying attribute (preferring
@xml:id
if both are provided), and we have left things as they are
to demonstrate this feature.
The witness
elements in the example collation also contain origDate
elements that provide dates or date ranges for the corresponding witnesses.
Where a witness can be dated to a specific year, the @when
attribute is sufficient to specify this; if it can be dated within a range of years,
the @notBefore
and @notAfter
attributes should be used.
While such dating elements are not required, our software includes them in the conversion process whenever possible.
This way, phylogenetic methods that employ clock models and other chronolological constraints can benefit from this information when it is provided.
Each variation unit is encoded as an app
element with a unique
@xml:id
attribute. Within a variation unit, a lem
element
without a @wit
attribute presents the main text, and it is followed
by rdg
elements that describe variant readings (with the first
rdg
duplicating the lem
reading and detailing its witnesses) and
their attestations among the witnesses. (Situations where the lem
reading is not duplicated by the first rdg
element, but has its own
@wit
attribute, are also supported.) For conciseness, we use the
@n
attribute for each reading as a local identifier; the recommended
practice for readings that will be referenced elsewhere is to use the
@xml:id
attribute, and this software will use this as the identifier
if it is specified, but we have only specified @xml:id
attributes
for rdg
elements referenced in other variation units to demonstrate
the flexibility of the software. For witnesses with missing or ambiguous
readings at a given variation unit, we use the witDetail
element.
For ambiguous readings, we specify their possible disambiguations with
the @target
attribute and express our degrees of certainty about
these disambiguations using certainty
elements under the
witDetail
element.
The TEI XML file
for this example is available in the example
directory in the GitHub repository.
IQ-TREE
IQ-TREE is a popular phylogenetic analysis package.
To use it to perform a maximum likelihood phylogenetic analysis of the Ephesians example,
convert the TEI XML to NEXUS format using teiphy
with the command
teiphy -t reconstructed -t defective -t orthographic -m overlap -m lac -s"*" -s T --fill-correctors example/ubs_ephesians.xml ubs_ephesians_iqtree.nexus
This file can then be run in IQ-TREE with the following command:
iqtree -s ubs_ephesians_iqtree.nexus -m MK -bb 1000
This uses the Lewis Mk substitution model (without ascertainment bias correction) and with 1000 bootstrap replicates. If you wish to use ascertainment bias correction, then you must first exclude constant sites from the output as follows:
teiphy -t reconstructed -t defective -t orthographic -m overlap -m lac -s"*" -s T --drop-constant --fill-correctors example/ubs_ephesians.xml ubs_ephesians_iqtree.nexus
This file can then be run in IQ-TREE with ascertainment bias correction using the following command:
iqtree -s ubs_ephesians_iqtree.nexus -m MK+ASC -bb 1000
An example of a tree produced by IQ-TREE is found below:
Running this example with IQ-TREE is part of the continuous integration pipeline:
MrBayes
MrBayes is a Bayesian phylogenetic software package.
To use it to perform a phylogenetic analysis of the Ephesians example,
convert the TEI XML to NEXUS format using teiphy
with this command:
teiphy -t reconstructed -t defective -t orthographic -m overlap -m lac -s"*" -s T --fill-correctors --no-labels example/ubs_ephesians.xml ubs_ephesians_mrbayes.nexus
Note
MrBayes requires the --no-labels
flag.
This will generate the input file for MrBayes with the default strict clock model. If you would prefer to use an uncorrelated (independent gamma rate) clock model, then you can do so with the following command:
teiphy -t reconstructed -t defective -t orthographic -m overlap -m lac -s"*" -s T --fill-correctors --no-labels --clock uncorrelated example/ubs_ephesians.xml ubs_ephesians_mrbayes.nexus
This file can then be read into MrBayes as follows:
mb ubs_ephesians_mrbayes.nexus
More settings can be added manually to the NEXUS file to control the Bayesian analysis as described in the MrBayes manual.
An example of a maximum clade credibility tree produced by MrBayes for the generated input with a strict clock is found below. The label on each internal node is the probability of that clade being present in the posterior:
Running this example with MrBayes is part of the continuous integration pipeline:
BEAST 2.7
BEAST 2 is another Bayesian phylogenetic software package that boasts various model options and extensive customizability.
To use it to perform a phylogenetic analysis of the Ephesians example,
convert the TEI XML to BEAST 2.7 XML format using teiphy
with this command:
teiphy -t reconstructed -t defective -t orthographic -t subreading -m overlap -m lac -s"*" -s T --fill-correctors example/ubs_ephesians.xml ubs_ephesians_beast.xml
This will generate the input file for BEAST with the default strict clock model. If you would prefer to use an uncorrelated random clock model or a local random clock model, then you can do so with either of the following commands, respectively:
teiphy -t reconstructed -t defective -t orthographic -t subreading -m overlap -m lac -s"*" -s T --fill-correctors --clock uncorrelated example/ubs_ephesians.xml ubs_ephesians_beast.xml
teiphy -t reconstructed -t defective -t orthographic -t subreading -m overlap -m lac -s"*" -s T --fill-correctors --clock local example/ubs_ephesians.xml ubs_ephesians_beast.xml
To run this input, you must first make sure that you have BEAST 2.7 or later installed (as earlier versions will not be compatible with the XML format)
and that the BEASTLabs and BDSKY packages are installed (which you can do using the packagemanager
utility that comes with BEAST 2).
Once these packages are installed, you can run the example with the command
beast ubs_ephesians_beast.xml
An example of a maximum clade credibility tree produced by BEAST 2 for the generated input with a strict clock is found below. The label on each internal node is the probability of that clade being present in the posterior:
Running this example with BEAST 2 is part of the continuous integration pipeline:
STEMMA
STEMMA is a phylogenetic analysis program written by Stephen C. Carlson. It searches for an optimal stemma topology according to the maximum-parsimony criterion and uses reticulating links to model contamination between branches to form a phylogenetic network.
To create the files required for STEMMA, run this command:
teiphy -t reconstructed -t defective -t orthographic -m overlap -m lac -s"*" -s T --fill-correctors --format stemma example/ubs_ephesians.xml stemma_example
This will create two files: stemma_example
(containing the textual information from the collation) and stemma_example_chron
(containing date ranges for witnesses).
These can then be used with Carlson’s prep program to prepare the file for phylogenetic analysis:
prep stemma_example
Finally, the analysis is run with these commands:
stemma stemma_example a 100
soln stemma_example SOLN
This begins a heuristic search for the optimal stemma using a simulated annealing approach (option a
) for 100 iterations.
An example of a tree produced by STEMMA is found below:
Note that some witnesses (e.g., 012, 35) from the collation are excluded from this tree by STEMMA because they have the same reading sequence as another witness after their reconstructed, defective, and orthographic readings have been regularized.
Running this example with STEMMA is part of the continuous integration pipeline: