Extracting parts of the Open Tree of Life

The perl code below allows you to quickly extract subtrees from the draft version of the Open Tree of Life. First find the _ott number for the taxon you want (e.g. Arthropoda_ott632179), which you can do by grepping through the tre file (grep -o 'Arthropoda[^,]*' draftversion1.tre). Then, to extract, say up to 10 levels down of Arthropods (generating a tree with 90067 tips), do

./subtree_extract.pl -d=10 draftversion3.tre 632179

That will output the subtree to a newick file called 632179.nwk. You can provide the script with as many numbers as you like: it will create that number of newick files for you. If you omit the -d=N line, it will extract the entire subtree. It’s trivial to do ./subtree_extract.pl draftversion1.tre 632179 and extract the full arthropod subtree in a fraction of a second, but the resulting tree will have 1130747 tips, so tends to kill any software you might use to view it (although I’m impressed that it does actually open in Archaeopteryx).

Some other useful numbers you could try: 801601 (vertebrates ~ 85,000 spp), 691846 (animals, ~ 1.44 million spp), 99252 (flowering plants ~ 260,000 spp).

Note that if you extract down to a fixed depth you may be left with some empty taxon names, for example ((a,,b),); where a node at that level is unnamed in the OpenTree file. This is allowed by the Newick format, but some tree libraries don’t like it. I had problems using DendroPy 3 to read the files, but DendroPy 4 is fine. Annoyingly, Dendroscope doesn’t appear to like them either.

subtree_extract.pl (code in the public domain)

#!/usr/bin/perl -sw
use strict;
use vars qw/$d/;
use File::ReadBackwards;
 
# assumes that close braces are not used in taxon names
 
my $OpenToL = shift @ARGV;
tie *TREE, "File::ReadBackwards", $OpenToL, ")" || die "can't read '$OpenToL' $!" ;
my $match = join "|", @ARGV;
my %braces = ();
while (<TREE>) { #read in reverse order, separated by close brackets
  if (scalar %braces) {  # currently in the middle of extracting one or more subtrees
    foreach my $k (keys(%braces)) {
       $braces{$k}->{depth}++;
    }
    my $open_brace = ""; # empty on first iteration, the open brace char afterwards
    foreach (reverse split(/\(/)) { #we are going backwards => count # of open braces
      foreach my $k (keys(%braces)) {
        if ($open_brace) {
         $braces{$k}->{depth}--;
        }
        if ($braces{$k}->{depth} == 0) {
          unshift @{$braces{$k}->{lines}}, $open_brace;
          open FH, ">", "$k.nwk" || die "Couldn't open file '$k.nwk' for writing: $!";
          print(FH) foreach (@{$braces{$k}->{lines}});
          close FH;
          delete $braces{$k};
        } else {
          if (!($d) || ($braces{$k}->{depth} < $d)) {
            unshift @{$braces{$k}->{lines}}, $_.$open_brace;
          } elsif ($braces{$k}->{depth} == $d) {
            unshift @{$braces{$k}->{lines}}, $_;
          }
        }
      }
      $open_brace = "(";
    }
  }
  if (m|^([- \w'/]+_ott($match)'?)\D|) {
    $braces{$2} = {depth=>0, lines=>["$1",";"]};
  }
}

 

13 thoughts on “Extracting parts of the Open Tree of Life

  1. Hi!

    I am trying to use your script, but following your example of
    $ ./subtree_extract.pl -d=10 draftversion1.tre 632179
    I get about 90,000 lines of
    “Use of uninitialized value $_ in pattern match (m//) at ./subtree_extract.pl line 40.”
    Do you know why that might be?

    Thanks,
    Nik

    • Apologies for the late reply, this was caught in my spam filter. Perl v5.16.2. It’s my fault: the while() should be while(<TREE>). Corrected now

  2. Hello Yan! Thank you for sharing your work, it’s very useful!
    There is a small typo in the text “It’s trivial to do ./subtree_extract.pl -draftversion1.tre 632179” with unnecessary dash sign before draftversion1.tre. If one will paste the command like this it’ll not work.

  3. extract_tree.pl draftversion4.tre 99252

    Can’t locate File/ReadBackwards.pm in @INC (you may need to install the File::ReadBackwards module) (@INC contains: /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18 /Network/Library/Perl/5.18/darwin-thread-multi-2level /Network/Library/Perl/5.18 /Library/Perl/Updates/5.18.2 /System/Library/Perl/5.18/darwin-thread-multi-2level /System/Library/Perl/5.18 /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level /System/Library/Perl/Extras/5.18 .) at extract_tree.pl line 4.
    BEGIN failed–compilation aborted at extract_tree.pl line 4.

    • ‘you may need to install the File::ReadBackwards module’

      So that’s what you will need to do. Try

      >sudo cpan install File::ReadBackwards

  4. Hi Yan,
    I tried to run the script with the ott_id 304358. It gives me the following error:

    readline() on unopened filehandle TREE at ./subtree_extract.pl line 12.

    I used the following command:
    /subtree_extract.pl -d=10 draftversion3.tre 304358
    Can you help me find out what the problem is? Thanks.

    • Did you download draftversion3.tre from the Open Tree of Life project? You might as well use draftversion4.tre now anyway. You’ll either need to have the tree in the same dir as the script you are running, or provide the path to it.

      • Hi Yan,
        Thanks. I downloaded the draftversion3.tre.gz file, extracted it in the same directory and it worked. I was wondering if its possible to give the url to the draft tree when I am running the script instead of downloading it first.

        • No, that’s not possible, sorry. The point with this script is that it reads from the end of the file and therefore has to download the entire tree first, in which case you might as well download the file by hand. Also, it would rather negate the point of fast extraction if you had to download the huge file every time you wanted to get a subtree.

  5. Updated for OpenTree v5 (allow spaces and dashes in names). Use opentree5.0_simplified_names.tre or labelled_supertree_simplified_ottnames_with_monotypic.tre

Leave a Reply

Your email address will not be published. Required fields are marked *