Phylogenetic trees

PhyloTree

class PhyloTree(newick=None, children=None, alignment=None, alg_format='fasta', sp_naming_function=None, parser=None)[source]

Bases: Tree

Class to store a phylogenetic tree.

Extends the standard Tree instance by adding specific properties and methods to work with phylogentic trees.

__init__(newick=None, children=None, alignment=None, alg_format='fasta', sp_naming_function=None, parser=None)[source]
Parameters:
  • newick – If not None, initializes the tree from a newick, which can be a string or file object containing it.

  • children – If not None, the children to add to this node.

  • alignment – File containing a multiple sequence alignment.

  • alg_format – “fasta”, “phylip” or “iphylip” (interleaved).

  • parser – Parser to read the newick.

  • sp_naming_function – Function that gets a node name and returns the species name (see PhyloTree.set_species_naming_function()). By default, the 3 first letters of node names will be used as species identifier.

annotate_gtdb_taxa(taxid_attr='species', tax2name=None, tax2track=None, tax2rank=None, dbfile=None)[source]
annotate_ncbi_taxa(taxid_attr='species', tax2name=None, tax2track=None, tax2rank=None, dbfile=None)[source]

Add NCBI taxonomy annotation to all descendant nodes. Leaf nodes are expected to contain a feature (name, by default) encoding a valid taxid number.

All descendant nodes (including internal nodes) are annotated with the following new features:

Node.spname: scientific spcies name as encoded in the NCBI taxonomy database

Node.named_lineage: the NCBI lineage track using scientific names

Node.taxid: NCBI taxid number

Node.lineage: same as named_lineage but using taxid codes.

Note that for internal nodes, NCBI information will refer to the first common lineage of the grouped species.

Parameters:
  • taxid_attr (name) – the name of the feature that should be used to access the taxid number associated to each node.

  • tax2name (None) – A dictionary where keys are taxid numbers and values are their translation into NCBI scientific name. Its use is optional and allows to avoid database queries when annotating many trees containing the same set of taxids.

  • tax2track (None) – A dictionary where keys are taxid numbers and values are their translation into NCBI lineage tracks (taxids). Its use is optional and allows to avoid database queries when annotating many trees containing the same set of taxids.

  • tax2rank (None) – A dictionary where keys are taxid numbers and values are their translation into NCBI rank name. Its use is optional and allows to avoid database queries when annotating many trees containing the same set of taxids.

:param None dbfileIf provided, the provided file will be

used as a local copy of the NCBI taxonomy database.

Returns:

tax2name (a dictionary translating taxid numbers into scientific name), tax2lineage (a dictionary translating taxid numbers into their corresponding NCBI lineage track) and tax2rank (a dictionary translating taxid numbers into rank names).

collapse_lineage_specific_expansions(species=None, return_copy=True)[source]

Converts lineage specific expansion nodes into a single tip node (randomly chosen from tips within the expansion).

Parameters:

species (None) – If supplied, only expansions matching the species criteria will be pruned. When None, all expansions within the tree will be processed.

get_age(species2age)[source]

Implements the phylostratigrafic method described in:

Huerta-Cepas, J., & Gabaldon, T. (2011). Assigning duplication events to relative temporal scales in genome-wide studies. Bioinformatics, 27(1), 38-45.

get_age_balanced_outgroup(species2age)[source]

New in version 2.2.

Returns the node better balance current tree structure according to the topological age of the different leaves and internal node sizes.

Parameters:

species2age – A dictionary translating from leaf names into a topological age.

get_descendant_evol_events(sos_thr=0.0)[source]

Returns a list of all duplication and speciation events detected after this node. Nodes are assumed to be duplications when a species overlap is found between its child linages. Method is described more detail in:

“The Human Phylome.” Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldon T. Genome Biol. 2007;8(6):R109.

get_farthest_oldest_leaf(species2age, is_leaf_fn=None)[source]

Return the farthest oldest leaf to the current one.

It requires an species2age dictionary with the age estimation for all species.

Parameters:

is_leaf_fn (None) – A pointer to a function that receives a node instance as unique argument and returns True or False. It can be used to dynamically collapse nodes, so they are seen as leaves.

get_farthest_oldest_node(species2age)[source]

Return the farthest oldest node (leaf or internal).

The difference with get_farthest_oldest_leaf() is that in this function internal nodes grouping seqs from the same species are collapsed.

get_my_evol_events(sos_thr=0.0)[source]

Return list of duplication and speciation events involving this node.

Scanned nodes are also labeled internally as dup=True|False. You can access these labels using node.dup.

The algorithm scans all nodes from the given leafName to the root. Nodes are assumed to be duplications when a species overlap is found between its child linages. The method is described in more detail in:

Citation:

The Human Phylome. T. Genome Biol. 2007;8(6):R109.

get_speciation_trees(map_properties=None, autodetect_duplications=True, newick_only=False, prop='species')[source]

Return number of species trees, of duplications, and an iterator.

Calculates all possible species trees contained within a duplicated gene family tree as described in Treeko (see Marcet and Gabaldon, 2011 ).

Parameters:
  • map_properties – List of properties that should be mapped from the original gene family tree to each species tree subtree.

  • autodetect_duplications – If True, duplication nodes will be automatically detected using the Species Overlap algorithm (PhyloTree.get_descendants_evol_events()). If False, duplication nodes within the original tree are expected to contain the property “evoltype=’D’”.

get_species()[source]

Returns the set of species covered by its partition.

iter_species()[source]

Returns an iterator over the species grouped by this node.

ncbi_compare(autodetect_duplications=True, cached_content=None)[source]
reconcile(species_tree)[source]

Returns the reconcilied topology with the provided species tree, and a list of evolutionary events inferred from such reconciliation.

set_species_naming_function(fn)[source]

Set the function used to get the species from the node’s name.

Parameters:

fn – Function that takes a nodename and returns the species name.

Example of a parsing function:

def parse_sp_name(node_name):
    return node_name.split("_")[1]
tree.set_species_naming_function(parse_sp_name)
property species
split_by_dups(autodetect_duplications=True)[source]

Return the list of subtrees when splitting by its duplication nodes.

Parameters:

autodetect_duplications (True) – If True, duplication nodes will be automatically detected using the Species Overlap algorithm (PhyloTree.get_descendants_evol_events(). If False, duplication nodes within the original tree are expected to contain the feature “evoltype=D”.

write(self, outfile=None, props=(), parser=None, format_root_node=False, is_leaf_fn=None)[source]

Return or write to file the newick representation.

Parameters:
  • outfile (str) – Name of the output file. If present, it will write the newick to that file instad of returning it as a string.

  • props (list) – Properties to write for all nodes using the Extended Newick Format. If None, write all available properties.

  • parser – Parser used to encode the tree in newick format.

  • format_root_node (bool) – If True, write content of the root node too. For compatibility reasons, this is False by default.

Example:

t.write(props=['species', 'sci_name'])

EvolEvent

class EvolEvent[source]

Basic evolutionary event. It stores all the information about an event(node) ocurred in a phylogenetic tree.

etype : D (Duplication), S (Speciation), L (gene loss),

in_seqs : the list of sequences in one side of the event.

out_seqs : the list of sequences in the other side of the event

node : link to the event node in the tree