Pruning the tree of life

The Open Tree of Life is huge, with 2543160 named nodes. There are good reasons why you might want to slim it down by pruning or simplifying branches. For example, I find it useful to have a tree of everything resolved to the species level: i.e. without subspecies, varieties, etc. The perl code below creates such a tree. It is also easily modifiable to produce a genus or family-level tree. It takes about 25 seconds on my (low powered) MacBook Air, most of which is down to reading the 317 Mb taxonomy.tsv file. You can call it via

./subspecies_delete.pl OT_draftversion2.tre taxonomy.tsv > nosubsp.tre

For the current draft tree (version 2), it identifies the following problematic taxa:

Species Ascochyta_fabae_ott1084624 is nested within another species
Species Fusarium_oxysporum_f_sp_lycopersici_ott810228 is nested within another species
Species Hieracium_linahamariense_ott3897051 is nested within another species
Species Centaurea_subjacea_ott3894019 is nested within another species 
Species Aegilops_triuncialis_ott608778 is nested within another species
Species Aegilops_crassa_ott267029 is nested within another species
Species Aegilops_tauschii_ott881533 is nested within another species
Species Aegilops_longissima_ott267020 is nested within another species
Species Aegilops_peregrina_ott34479 is nested within another species
Species Aegilops_comosa_ott267017 is nested within another species

Removing subspecies etc. doesn’t slim the tree much though, the trimmed version still has 2498945 nodes, which is only 2% smaller.

subspecies_delete.pl (code in the public domain)

#!/usr/bin/perl -sw
use strict;
use File::ReadBackwards;
 
my $tree = shift @ARGV; # first arg is location of tree
my $OpenToLTaxonomy = shift @ARGV; # second arg is the corresponding taxonomy.tsv file
open(TAXONOMY, "<", $OpenToLTaxonomy) 
  or die "Cannot open taxonomy file $OpenToLTaxonomy: $!";
 
my @header = split("\t", <TAXONOMY>);
my( $rank_index )= grep { $header[$_] eq "rank" } 0..$#header;
my( $OTTid_index )= grep { $header[$_] eq "uid" } 0..$#header;
my %species;
 
while (<TAXONOMY>) {
  if ((split("\t"))[$rank_index] eq "species") {
    $species{(split("\t"))[$OTTid_index]} = 1;
  }
}
close(TAXONOMY);
tie *BACKTREE, "File::ReadBackwards", $tree, ")"
  or die "can't read newick file '$tree' $!" ;
my $del_depth=0;
my $line = 0;
my @omit;
while (<BACKTREE>) { #read in reverse order, separated by close brackets
  my ($uid) = (/^[^,].*?_ott(\d+)[',;\)]/); #name for the prev brace (comma = no name)
  if ($del_depth) {
    #if we have started deleting, we keep track of which are deleted 
    my $pos = length;
    do {
      $pos = rindex($_,'(',$pos-1);
    } while ($pos != -1 && --$del_depth);
    unshift @omit, [$line, $pos]; #$pos == -1 means omit all this line
    $del_depth++ if ($del_depth); #if still in brace nest, next loop increases depth
  };
  if (defined $uid && exists $species{$uid}) {
    if ($del_depth) {
      my ($name) = (/^(.+?_ott\d+)[',;\)]/);
      warn("Species $name is nested within another species: ignoring this species");
    } else {
      $del_depth = 1;
    }
  }
  $line++;
}
close(BACKTREE);
 
#recalculate to count close braces from start of file
foreach (@omit) {
  $_->[0] = $line - $_->[0];
}
 
#now go forwards through the file, printing when @omit allows
$/ = ")";
open(FORETREE, "<", $tree) or die "cannot open $tree: $!";
while(<FORETREE>) {
  if (@omit && ($. == $omit[0][0])) {
    print substr($_, 0, $omit[0][1]) if ($omit[0][1] != -1);
    shift @omit;
  } else {
    print;
  }
}

2 thoughts on “Pruning the tree of life

  1. I am trying to prune the tree for certain plant and animal families. I am far from an expert in Perl and I am not entirely sure where in the code can I prune for those families I need. Any help would be welcome

    • It sounds like you are trying to extract subtrees from the open tree, rather than prune tips. So you probably want my script at http://yanwong.me/?page_id=1090. You first need to find the ‘Open Tree Taxonomy ID’ for those families, which is a number, and simply pass that to the script. You can find this number by searching on the OpenTree website: e.g. for Brassicaceae you should be directed to https://tree.opentreeoflife.org/opentree/argus/ottol@309271/Brassicaceae from where you can find the ott id 309271. You can then call my subtree extraction script as

      ./subtree_extract.pl draftversion4.tre 309271

      Note that for version 5 of the open tree, you’ll need to use the file

      labelled_supertree_simplified_ottnames_with_monotypic.tre

      or modify my script somehow.

Leave a Reply

Your email address will not be published. Required fields are marked *