Taxonomy databases

NCBITaxa

class NCBITaxa(dbfile=None, taxdump_file=None, memory=False, update=True)[source]

A local transparent connector to the NCBI taxonomy database.

__init__(dbfile=None, taxdump_file=None, memory=False, update=True)[source]

Open and keep a connection to the NCBI taxonomy database.

If it is not present in the system, it will download the database from the NCBI site first, and convert it to ete’s format.

annotate_tree(t, taxid_attr='name', tax2name=None, tax2track=None, tax2rank=None, ignore_unclassified=False)[source]

Annotate a tree containing taxids as leaf names.

The annotation adds the properties: ‘taxid’, ‘sci_name’, ‘lineage’, ‘named_lineage’ and ‘rank’.

Parameters:
  • t – a Tree (or Tree derived) instance.

  • taxid_attr (name) – Allows to set a custom node attribute containing the taxid number associated to each node (i.e. species in PhyloTree instances).

  • tax2name,tax2track,tax2rank – Use these arguments to provide pre-calculated dictionaries providing translation from taxid number and names,track lineages and ranks.

get_broken_branches(t, taxa_lineages, n2content=None)[source]

Returns a list of NCBI lineage names that are not monophyletic in the provided tree, as well as the list of affected branches and their size.

CURRENTLY EXPERIMENTAL

get_common_names(taxids)[source]
get_descendant_taxa(parent, intermediate_nodes=False, rank_limit=None, collapse_subspecies=False, return_tree=False)[source]

Return list of descendant taxids of the given parent.

Parent can be given as taxid or scientific species name.

If intermediate_nodes=True, the list will also have the internal nodes.

get_fuzzy_name_translation(name, sim=0.9)[source]

Return taxid, species name and match score from the NCBI database.

The results are for the best match for name in the NCBI database of taxa names, with a word similarity >= sim.

Parameters:
  • name – Species name (does not need to be exact).

  • sim (0.9) – Min word similarity to report a match (from 0 to 1).

get_lineage(taxid)[source]

Return lineage track corresponding to the given taxid.

The lineage track is a hierarchically sorted list of parent taxids.

get_lineage_translator(taxids)[source]

Return dict with lineage tracks corresponding to the given taxids.

The lineage tracks are a hierarchically sorted list of parent taxids.

get_name_translator(names)[source]

Return dict with taxids corresponding to the given scientific names.

Exact name match is required for translation.

get_rank(taxids)[source]

Return dict with NCBI taxonomy ranks for each list of taxids.

get_taxid_translator(taxids, try_synonyms=True)[source]

Return dict with the scientific names corresponding to the taxids.

get_topology(taxids, intermediate_nodes=False, rank_limit=None, collapse_subspecies=False, annotate=True)[source]

Return the minimal pruned NCBI taxonomy tree containing taxids.

Parameters:
  • intermediate_nodes (False) – If True, single child nodes representing the complete lineage of leaf nodes are kept. Otherwise, the tree is pruned to contain the first common ancestor of each group.

  • rank_limit (None) – If valid NCBI rank name is provided, the tree is pruned at that given level. For instance, use rank=”species” to get rid of sub-species or strain leaf nodes.

  • collapse_subspecies (False) – If True, any item under the species rank will be collapsed into the species upper node.

translate_to_names(taxids)[source]

Return list of scientific names corresponding to taxids.

update_taxonomy_database(taxdump_file=None)[source]

Update the ncbi taxonomy database.

It does it by downloading and parsing the latest taxdump.tar.gz file from the NCBI site.

Parameters:

taxdump_file – Alternative location of the taxdump.tax.gz file.

GTDBTaxa

class GTDBTaxa(dbfile=None, taxdump_file=None, memory=False)[source]

Local transparent connector to the GTDB taxonomy database.

__init__(dbfile=None, taxdump_file=None, memory=False)[source]
annotate_tree(t, taxid_attr='name', tax2name=None, tax2track=None, tax2rank=None, ignore_unclassified=False)[source]

Annotate a tree containing taxids as leaf names.

It annotates by adding the properties ‘taxid’, ‘sci_name’, ‘lineage’, ‘named_lineage’ and ‘rank’.

Parameters:
  • t – Tree to annotate.

  • taxid_attr – Node attribute (property) containing the taxid number associated to each node (i.e. species in PhyloTree instances).

  • tax2rank (tax2name, tax2track,) – Pre-calculated dictionaries with translations from taxid number to names, track lineages and ranks.

get_broken_branches(t, taxa_lineages, n2content=None)[source]

Returns a list of GTDB lineage names that are not monophyletic in the provided tree, as well as the list of affected branches and their size. CURRENTLY EXPERIMENTAL

get_common_names(taxids)[source]
get_descendant_taxa(parent, intermediate_nodes=False, rank_limit=None, collapse_subspecies=False, return_tree=False)[source]

given a parent taxid or scientific species name, returns a list of all its descendants taxids. If intermediate_nodes is set to True, internal nodes will also be dumped.

get_name_lineage(taxnames)[source]

Given a valid taxname, return its corresponding lineage track as a hierarchically sorted list of parent taxnames.

get_rank(taxids)[source]

Give a list of GTDB string taxids, return a dictionary with their corresponding ranks. Examples:

> gtdb.get_rank([‘c__Thorarchaeia’, ‘RS_GCF_001477695.1’]) {‘c__Thorarchaeia’: ‘class’, ‘RS_GCF_001477695.1’: ‘subspecies’}

get_topology(taxnames, intermediate_nodes=False, rank_limit=None, collapse_subspecies=False, annotate=True)[source]

Return minimal pruned GTDB taxonomy tree containing all given taxids.

Parameters:
  • intermediate_nodes – If True, single child nodes representing the complete lineage of leaf nodes are kept. Otherwise, the tree is pruned to contain the first common ancestor of each group.

  • rank_limit – If valid NCBI rank name is provided, the tree is pruned at that given level. For instance, use rank=”species” to get rid of sub-species or strain leaf nodes.

  • collapse_subspecies – If True, any item under the species rank will be collapsed into the species upper node.

update_taxonomy_database(taxdump_file=None)[source]

Update the GTDB taxonomy database.

It updates it by downloading and parsing the latest gtdbtaxdump.tar.gz file.

Parameters:

taxdump_file – Alternative location of gtdbtaxdump.tar.gz.