Now that a few draft versions of the Open Tree of Life have been made available, I’ve been playing with them. But since the trees have ~2.5 million leaves, that’s not always an easy task. The scripts below are examples of code that can do this. The first allows quick extraction of a set of subtrees, the second allows fast deletion of a set of subtrees while retaining most of the main tree.
I’ve come to the conclusion that the Newick file format is not ideal for huge trees like this, because the name of a node is placed after the details of the node. For instance, for the node named “Pan”, the format is
((Pan_paniscus,Pan_troglodytes)Pan). That means all the node details need to be read (and maybe parsed) before you get to the name – which can be a pain for huge nodes.
For this reason the Perl code listed below tends to use the File::ReadBackwards module, breaking the tree down by close braces, which allows the name to be read before parsing a node. But it would have been much easier if the Newick restaurant brigade had adopted a format like