API Reference

Collation

class teiphy.Collation(xml: ElementTree, manuscript_suffixes: List[str] = [], trivial_reading_types: List[str] = [], missing_reading_types: List[str] = [], fill_corrector_lacunae: bool = False, verbose: bool = False)

Base class for storing TEI XML collation data internally.

This corresponds to the entire XML tree, rooted at the TEI element of the collation.

manuscript_suffixes

A list of suffixes used to distinguish manuscript subwitnesses like first hands, correctors, main texts, alternate texts, and multiple attestations from their base witnesses.

trivial_reading_types

A set of reading types (e.g., “reconstructed”, “defective”, “orthographic”, “subreading”) whose readings should be collapsed under the previous substantive reading.

missing_reading_types

A set of reading types (e.g., “lac”, “overlap”) whose readings should be treated as missing data.

fill_corrector_lacunae

A boolean flag indicating whether or not to fill “lacunae” in witnesses with type “corrector”.

witnesses

A list of Witness instances contained in this Collation.

witness_index_by_id

A dictionary mapping base witness ID strings to their int indices in the witnesses list.

variation_units

A list of VariationUnit instances contained in this Collation.

readings_by_witness

A dictionary mapping base witness ID strings to lists of reading support coefficients for all units (with at least two substantive readings).

substantive_variation_unit_ids

A list of ID strings for variation units with two or more substantive readings.

substantive_variation_unit_reading_tuples

A list of (variation unit ID, reading ID) tuples for substantive readings.

verbose

A boolean flag indicating whether or not to print timing and debugging details for the user.

get_base_wit(wit: str)

Given a witness siglum, strips of the specified manuscript suffixes until the siglum matches one in the witness list or until no more suffixes can be stripped.

Parameters

wit – A string representing a witness siglum, potentially including suffixes to be stripped.

get_beast_code_map_for_unit(symbols, missing_symbol, vu_ind)

Returns a string containing state/reading code mappings in BEAST format using the given single-state and missing state symbols for the character/variation unit at the given index. If the variation unit at the given index is a singleton unit (i.e., if it has only one substantive reading), then a code for a dummy state will be included.

Parameters

vu_ind – An integer index for the desired unit.

Returns

A string containing comma-separated code mappings.

get_beast_date_map(taxlabels)

Returns a string representing witness-to-date mappings in BEAST format.

Since this format requires single dates as opposed to date ranges, witnesses with closed date ranges will be mapped to the average of their lower and upper bounds, and witnesses with open date ranges will not be mapped.

Parameters

taxlabels – A list of slugified taxon labels.

Returns

A string containing comma-separated date calibrations of the form witness_id=date.

get_beast_equilibrium_frequencies_for_unit(vu_ind)

Returns a string containing state/reading equilibrium frequencies in BEAST format for the character/variation unit at the given index. Since the equilibrium frequencies are not used with the substitution models, the equilibrium frequencies simply correspond to a uniform distribution over the states. If the variation unit at the given index is a singleton unit (i.e., if it has only one substantive reading), then an equilibrium frequency of 0 will be added for a dummy state.

Parameters

vu_ind – An integer index for the desired unit.

Returns

A string containing space-separated equilibrium frequencies.

get_beast_origin_span(tip_date_range)

Returns a tuple containing the lower and upper bounds for the height of the origin of the Birth-Death Skyline model. The upper bound on the height of the tree is the difference between the latest tip date and the lower bound on the date of the original work, if both are defined; otherwise, it is left undefined. The lower bound on the height of the tree is the difference between the latest tip date and the upper bound on the date of the original work, if both are defined; otherwise, it is the difference between the earliest tip date and the latest, if both are defined.

Parameters

tip_date_range – A tuple containing the earliest and latest possible tip dates.

Returns

A tuple containing lower and upper bounds on the origin height for the Birth-Death Skyline model.

get_beast_root_frequencies_for_unit(vu_ind)

Returns a string containing state/reading root frequencies in BEAST format for the character/variation unit at the given index. The root frequencies are calculated from the intrinsic odds at this unit. If the variation unit at the given index is a singleton unit (i.e., if it has only one substantive reading), then a root frequency of 0 will be added for a dummy state. If no intrinsic odds are specified, then a uniform distribution over all states is assumed.

Parameters

vu_ind – An integer index for the desired unit.

Returns

A string containing space-separated root frequencies.

get_beast_symbols()

Returns a list of one-character symbols needed to represent the states of all substantive readings in BEAST format.

The number of symbols equals the maximum number of substantive readings at any variation unit.

Returns

A list of individual characters representing states in readings.

get_fasta_symbols()

Returns a list of one-character symbols needed to represent the states of all substantive readings in FASTA format.

The number of symbols equals the maximum number of substantive readings at any variation unit.

Returns

A list of individual characters representing states in readings.

get_hennig86_symbols()

Returns a list of one-character symbols needed to represent the states of all substantive readings in Hennig86 format.

The number of symbols equals the maximum number of substantive readings at any variation unit.

Returns

A list of individual characters representing states in readings.

get_nexus_symbols()

Returns a list of one-character symbols needed to represent the states of all substantive readings in NEXUS.

The number of symbols equals the maximum number of substantive readings at any variation unit.

Returns

A list of individual characters representing states in readings.

get_phylip_symbols()

Returns a list of one-character symbols needed to represent the states of all substantive readings in PHYLIP format.

The number of symbols equals the maximum number of substantive readings at any variation unit.

Returns

A list of individual characters representing states in readings.

get_readings_by_witness_for_unit(vu: VariationUnit)

Returns a dictionary mapping witness IDs to a list of their reading coefficients for a given variation unit.

Parameters

vu – A VariationUnit to be processed.

Returns

A dictionary mapping witness ID strings to a list of their coefficients for all substantive readings in this VariationUnit.

get_tip_date_range()

Gets the minimum and maximum dates attested among the witnesses. Also checks if the witness with the latest possible date has a fixed date (i.e, if the lower and upper bounds for its date are the same) and issues a warning if not, as this will cause unusual behavior in BEAST 2.

Returns

A tuple containing the earliest and latest possible tip dates.

parse_apps(xml: ElementTree)

Given an XML tree for a collation, populates its list of variation units from its app elements.

Parameters

xml – An lxml.etree.ElementTree representing an XML tree rooted at a TEI element.

parse_intrinsic_odds(xml: ElementTree)

Given an XML tree for a collation, populates this Collation’s list of intrinsic probability categories (e.g., “absolutely more likely,” “highly more likely,” “more likely,” “slightly more likely,” “equally likely”) and its dictionary mapping these categories to numerical odds. If a category does not contain a certainty element specifying its number, then it will be assumed to be a parameter to be estimated.

Parameters

xml – An lxml.etree.ElementTree representing an XML tree rooted at a TEI element.

parse_list_wit(xml: ElementTree)

Given an XML tree for a collation, populates its list of witnesses from its listWit element. If the XML tree does not contain a listWit element, then a ParsingException is thrown listing all distinct witness sigla encountered in the collation.

Parameters

xml – An lxml.etree.ElementTree representing an XML tree rooted at a TEI element.

parse_origin_date_range(xml: ElementTree)

Given an XML tree for a collation, populates this Collation’s list of origin date bounds.

Parameters

xml – An lxml.etree.ElementTree representing an XML tree rooted at a TEI element.

parse_readings_by_witness()

Populates the internal dictionary mapping witness IDs to a list of their reading support sets for all variation units, and then fills the empty reading support sets for witnesses of type “corrector” with the entries of the previous witness.

parse_transcriptional_rates(xml: ElementTree)

Given an XML tree for a collation, populates this Collation’s dictionary mapping transcriptional change categories (e.g., “aural confusion,” “visual error,” “clarification”) to numerical rates. If a category does not contain a certainty element specifying its number, then it will be assumed to be a parameter to be estimated.

Parameters

xml – An lxml.etree.ElementTree representing an XML tree rooted at a TEI element.

to_beast(file_addr: Union[Path, str], drop_constant: bool = False, clock_model: ClockModel = ClockModel.strict, ancestral_logger: AncestralLogger = AncestralLogger.state, seed: Optional[int] = None)

Writes this Collation to a file in BEAST format with the given address.

Parameters
  • file_addr – A string representing the path to an output file.

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • clock_model – A ClockModel option indicating which clock model to use.

  • ancestral_logger – An AncestralLogger option indicating which class of logger (if any) to use for ancestral states.

  • seed – A seed for random number generation (for setting initial values of unspecified transcriptional rates).

to_csv(file_addr: Union[Path, str], drop_constant: bool = False, ambiguous_as_missing: bool = False, proportion: bool = False, table_type: TableType = TableType.matrix, split_missing: bool = True, **kwargs)

Writes this Collation to a comma-separated value (CSV) file with the given address.

If your witness IDs are numeric (e.g., Gregory-Aland numbers), then they will be written in full to the CSV file, but Excel will likely interpret them as numbers and truncate any leading zeroes!

Parameters
  • file_addr – A string representing the path to an output CSV file; the file type should be .csv.

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • ambiguous_as_missing – An optional flag indicating whether to treat all ambiguous states as missing data.

  • proportion (bool, optional) – An optional flag indicating whether or not to calculate distances as proportions over extant, unambiguous variation units.

  • table_type (TableType, optional) – A TableType option indicating which type of tabular output to generate. Only applicable for tabular outputs. Default value is “matrix”.

  • split_missing – An optional flag indicating whether or not to treat missing characters/variation units as having a contribution of 1 split over all states/readings; if False, then missing data is ignored (i.e., all states are 0). Default value is True.

  • **kwargs – Keyword arguments for pandas.DataFrame.to_csv.

to_dataframe(drop_constant: bool = False, ambiguous_as_missing: bool = False, proportion: bool = False, table_type: TableType = TableType.matrix, split_missing: bool = True)

Returns this Collation in the form of a Pandas DataFrame array, including the appropriate row and column labels.

Parameters
  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • ambiguous_as_missing (bool, optional) – An optional flag indicating whether to treat all ambiguous states as missing data.

  • proportion (bool, optional) – An optional flag indicating whether or not to calculate distances as proportions over extant, unambiguous variation units.

  • table_type (TableType, optional) – A TableType option indicating which type of tabular output to generate. Only applicable for tabular outputs. Default value is “matrix”.

  • split_missing – An optional flag indicating whether or not to treat missing characters/variation units as having a contribution of 1 split over all states/readings; if False, then missing data is ignored (i.e., all states are 0). Default value is True.

Returns

A Pandas DataFrame corresponding to a collation matrix with reading frequencies or a long table with discrete reading states.

to_distance_matrix(drop_constant: bool = False, proportion=False)

Transforms this Collation into a NumPy distance matrix between witnesses, along with an array of its labels for the witnesses. Distances can be computed either as counts of disagreements (the default setting), or as proportions of disagreements over all variation units where both witnesses have singleton readings.

Parameters
  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • proportion (bool, optional) – An optional flag indicating whether or not to calculate distances as proportions over extant, unambiguous variation units.

Returns

A NumPy distance matrix with a row and column for each witness. A list of witness ID strings.

to_excel(file_addr: Union[Path, str], drop_constant: bool = False, ambiguous_as_missing: bool = False, proportion: bool = False, table_type: TableType = TableType.matrix, split_missing: bool = True)

Writes this Collation to an Excel (.xlsx) file with the given address.

Since Pandas is deprecating its support for xlwt, specifying an output in old Excel (.xls) output is not recommended.

Parameters
  • file_addr – A string representing the path to an output Excel file; the file type should be .xlsx.

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • ambiguous_as_missing – An optional flag indicating whether to treat all ambiguous states as missing data.

  • proportion (bool, optional) – An optional flag indicating whether or not to calculate distances as proportions over extant, unambiguous variation units.

  • table_type (TableType, optional) – A TableType option indicating which type of tabular output to generate. Only applicable for tabular outputs. Default value is “matrix”.

  • split_missing (bool, optional) – An optional flag indicating whether or not to treat missing characters/variation units as having a contribution of 1 split over all states/readings; if False, then missing data is ignored (i.e., all states are 0). Default value is True.

to_fasta(file_addr: Union[Path, str], drop_constant: bool = False)

Writes this Collation to a file in FASTA format with the given address. Note that because FASTA format does not support NEXUS-style ambiguities, such ambiguities will be treated as missing data.

Parameters
  • file_addr – A string representing the path to an output file.

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

to_file(file_addr: Union[Path, str], format: Optional[Format] = None, drop_constant: bool = False, split_missing: bool = True, char_state_labels: bool = True, frequency: bool = False, ambiguous_as_missing: bool = False, proportion: bool = False, calibrate_dates: bool = False, mrbayes: bool = False, clock_model: ClockModel = ClockModel.strict, ancestral_logger: AncestralLogger = AncestralLogger.state, table_type: TableType = TableType.matrix, seed: Optional[int] = None)

Writes this Collation to the file with the given address.

Parameters
  • file_addr (Union[Path, str]) – The path to the output file.

  • format (Format, optional) – The desired output format. If None then it is infered from the file suffix. Defaults to None.

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • split_missing (bool, optional) – An optional flag indicating whether to treat missing characters/variation units as having a contribution of 1 split over all states/readings; if False, then missing data is ignored (i.e., all states are 0). Not applicable for NEXUS, HENNIG86, PHYLIP, FASTA, or STEMMA format. Default value is True.

  • char_state_labels (bool, optional) – An optional flag indicating whether to print the CharStateLabels block in NEXUS output. Default value is True.

  • frequency (bool, optional) – An optional flag indicating whether to use the StatesFormat=Frequency setting instead of the StatesFormat=StatesPresent setting (and thus represent all states with frequency vectors rather than symbols) in NEXUS output. Note that this setting is necessary to make use of certainty degrees assigned to multiple ambiguous states in the collation. Default value is False.

  • ambiguous_as_missing (bool, optional) – An optional flag indicating whether to treat all ambiguous states as missing data. If this flag is set, then only base symbols will be generated for the NEXUS file. It is only applied if the frequency option is False. Default value is False.

  • proportion (bool, optional) – An optional flag indicating whether to populate a distance matrix’s cells with a proportion of disagreements to variation units where both witnesses are extant. It is only applied if the table_type option is “distance”. Default value is False.

  • calibrate_dates – An optional flag indicating whether to add an Assumptions block that specifies date distributions for witnesses in NEXUS output. This option is intended for inputs to BEAST 2.

  • mrbayes – An optional flag indicating whether to add a MrBayes block that specifies model settings and age calibrations for witnesses in NEXUS output. This option is intended for inputs to MrBayes.

  • clock_model – A ClockModel option indicating which type of clock model to use. This option is intended for inputs to MrBayes and BEAST 2. MrBayes does not presently support a local clock model, so it will default to a strict clock model if a local clock model is specified.

  • ancestral_logger – An AncestralLogger option indicating which class of logger (if any) to use for ancestral states. This option is intended for inputs to BEAST 2.

  • table_type – A TableType option indicating which type of tabular output to generate. Only applicable for tabular outputs. Default value is “matrix”.

  • seed – A seed for random number generation (for setting initial values of unspecified transcriptional rates in BEAST 2 XML output).

to_hennig86(file_addr: Union[Path, str], drop_constant: bool = False)

Writes this Collation to a file in Hennig86 format with the given address. Note that because Hennig86 format does not support NEXUS-style ambiguities, such ambiguities will be treated as missing data.

Parameters
  • file_addr – A string representing the path to an output file.

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

to_long_table(drop_constant: bool = False)

Returns this Collation in the form of a long table with columns for taxa, characters, reading indices, and reading values. Note that this method treats ambiguous readings as missing data.

Parameters

drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

Returns

A NumPy array with columns for taxa, characters, reading indices, and reading values, and rows for each combination of these values in the matrix. A list of column label strings.

to_nexus(file_addr: Union[Path, str], drop_constant: bool = False, char_state_labels: bool = True, frequency: bool = False, ambiguous_as_missing: bool = False, calibrate_dates: bool = False, mrbayes: bool = False, clock_model: ClockModel = ClockModel.strict)

Writes this Collation to a NEXUS file with the given address.

Parameters
  • file_addr – A string representing the path to an output NEXUS file; the file type should be .nex, .nexus, or .nxs.

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • char_state_labels – An optional flag indicating whether or not to include the CharStateLabels block.

  • frequency – An optional flag indicating whether to use the StatesFormat=Frequency setting instead of the StatesFormat=StatesPresent setting (and thus represent all states with frequency vectors rather than symbols). Note that this setting is necessary to make use of certainty degrees assigned to multiple ambiguous states in the collation.

  • ambiguous_as_missing – An optional flag indicating whether to treat all ambiguous states as missing data. If this flag is set, then only base symbols will be generated for the NEXUS file. It is only applied if the frequency option is False.

  • calibrate_dates – An optional flag indicating whether to add an Assumptions block that specifies date distributions for witnesses. This option is intended for inputs to BEAST 2.

  • mrbayes – An optional flag indicating whether to add a MrBayes block that specifies model settings and age calibrations for witnesses. This option is intended for inputs to MrBayes.

  • clock_model – A ClockModel option indicating which type of clock model to use. This option is intended for inputs to MrBayes and BEAST 2. MrBayes does not presently support a local clock model, so it will default to a strict clock model if a local clock model is specified.

to_nexus_table(drop_constant: bool = False, ambiguous_as_missing: bool = False)

Returns this Collation in the form of a table with rows for taxa, columns for characters, and reading IDs in cells.

Parameters
  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • ambiguous_as_missing (bool, optional) – An optional flag indicating whether to treat all ambiguous states as missing data.

Returns

A NumPy array with rows for taxa, columns for characters, and reading IDs in cells. A list of substantive reading ID strings. A list of witness ID strings.

to_numpy(drop_constant: bool = False, split_missing: bool = True)

Returns this Collation in the form of a NumPy array, along with arrays of its row and column labels.

Parameters
  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

  • split_missing – An optional flag indicating whether or not to treat missing characters/variation units as having a contribution of 1 split over all states/readings; if False, then missing data is ignored (i.e., all states are 0). Default value is True.

Returns

A NumPy array with a row for each substantive reading and a column for each witness. A list of substantive reading ID strings. A list of witness ID strings.

to_phylip(file_addr: Union[Path, str], drop_constant: bool = False)

Writes this Collation to a file in PHYLIP format with the given address. Note that because PHYLIP format does not support NEXUS-style ambiguities, such ambiguities will be treated as missing data.

Parameters
  • file_addr – A string representing the path to an output file.

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

to_stemma(file_addr: Union[Path, str])

Writes this Collation to a STEMMA file without an extension and a Chron file (containing low, middle, and high dates for all witnesses) without an extension.

Since this format does not support ambiguous states, all reading vectors with anything other than one nonzero entry will be interpreted as lacunose.

Parameters
  • file_addr – A string representing the path to an output STEMMA prep file; the file should have no extension.

  • name (The accompanying chron file will match this file) –

  • end. (except that it will have "_chron" appended to the) –

  • drop_constant (bool, optional) – An optional flag indicating whether to ignore variation units with one substantive reading.

update_origin_date_range_from_witness_date_ranges()

Conditionally updates the upper bound on the date of origin of the work represented by this Collation based on the bounds on the witnesses’ dates. If none of the witnesses have bounds on their dates, then nothing is done. This method is only invoked if the work’s date of origin does not already have its upper bound defined.

update_witness_date_ranges_from_origin_date_range()

Attempts to update the lower bounds on the witnesses’ dates of origin of the work represented by this Collation using the upper bound on the date of origin of the work represented by this Collation. This method is only invoked if the upper bound on the work’s date of origin was not already defined (i.e., if update_origin_date_range_from_witness_date_ranges was not invoked).

validate_intrinsic_relations()

Checks if any VariationUnit’s intrinsic_relations map is not a forest. If any is not, then an IntrinsicRelationsException is thrown describing the VariationUnit at fault.

validate_wits(xml: ElementTree)

Given an XML tree for a collation, checks if any witness sigla listed in a rdg, rdgGrp, or witDetail element, once stripped of ignored suffixes, is not found in the witness list. A warning will be issued for each distinct siglum like this. This method also checks if the upper bound of any witness’s date is earlier than the lower bound on the collated work’s date of origin and throws an exception if so.

Parameters

xml – An lxml.etree.ElementTree representing an XML tree rooted at a TEI element.

Variation Unit

class teiphy.VariationUnit(xml: Element, verbose: bool = False)

Base class for storing TEI XML variation unit data internally.

This corresponds to an app element in the collation.

id

The ID string of this variation unit, which should be unique.

readings

A list of Readings contained in this VariationUnit.

intrinsic_relations

A dictionary mapping pairs of IDs of Readings in this VariationUnit to the intrinsic odds category

describing the two readings' relative probability of being authorial.
transcriptional_relations

A dictionary mapping pairs of IDs of Readings in this VariationUnit to a set of transcriptional change categories

that could explain the rise of the second reading from the first.
parse(xml: Element, verbose: bool = False)

Given an XML element, recursively parses its subelements for readings, reading groups, and witness details.

Other children of app elements, such as note, noteGrp, and wit elements, are ignored.

Parameters
  • xml – An lxml.etree.Element representing an app element.

  • verbose – An optional boolean flag indicating whether or not to print status updates.

Witness

class teiphy.Witness(xml: Element, verbose: bool = False)

Base class for storing TEI XML witness data internally.

This corresponds to a witness element in the collation.

id

The ID string of this Witness. It should be unique.

type

A string representing the type of witness. Examples include “corrector”, “version”, and “father”.

date_range

A list containing a low and high date for this Witness.

Reading

class teiphy.Reading(xml: Element, verbose: bool = False)

Base class for storing TEI XML reading data internally.

This can correspond to a lem, rdg, or witDetail element in the collation.

id

The ID string of this reading, which should be unique within its parent app element.

type

A string representing the type of reading. Examples include “reconstructed”, “defective”, “orthographic”, “subreading”, “ambiguous”, “overlap”, and “lac”. The default value is “substantive”.

text

Serialization of the contents of this element.

wits

A list of sigla referring to witnesses that support this reading.

targets

A list of other reading ID strings to which this reading corresponds. For substantive readings, this should be empty. For ambiguous readings, it should contain references to the readings that might correspond to this one. For overlap readings, it should contain a reference to the reading from the overlapping variation unit responsible for the overlap.

certainties

A dictionary mapping target reading IDs to floating-point certainty values.

parse(xml: Element, verbose: bool = False)

Given an XML element, recursively parses it and its subelements.

Parameters
  • xml – A lem, rdg, or witDetail element.

  • verbose – An optional flag indicating whether or not to print status updates.