Mapping OpenTree names to Encyclopedia of Life IDs

Below is a perl script to convert OpenTree ott IDs to Encyclopedia of Life page IDs. This allows access to common names for species, exemplar images, distribution maps, and so on. It doesn’t use the scientific names for species (which can have problems with synonyms, misspellings, and animal/plant duplicates), but instead takes the GBIF, NCBI, WoRMS, Index Fungorum, and IRMNG numbers from the OpenTree taxonomy.tsv file (version ≥ 2.9: thanks Jonathan Rees!), and looks them up using the appropriate EoL API.

Since each taxon can result in several API calls, please don’t bombard the EoL API with the entire 2.5 million ott IDs from the full tree, even though the script does make that possible. If you want to map all ott IDs to EoL pages, someone from the Encyclopedia of Life tells me that they are soon going to publish a table mapping all their provider IDs to EoL page IDs: wait for that before converting the full OpenTree.

As well as the taxonomy.tsv file, you need to provide a file which contains strings of the form _ottXXXXXX, giving the ott numbers to look up. This could even be the OpenTree .tre file: if so, to avoid millions of lookups, you should provide a 3rd argument giving the number of taxa to pick at random from this file. So for example, the following code looks up out 100 randomly chosen taxa from the OpenTree.

./ott2eol.pl ott/taxonomy.tsv OT_draftversion2.tre 100

Or for repeatability, you can do the same but with a specific random seed (e.g. 12345)

./ott2eol.pl ott/taxonomy.tsv OT_draftversion2.tre 100 12345

If you want a specific set of taxa, extract them into a text file and run the script as

./ott2eol.pl ott/taxonomy.tsv specific_taxa.txt

The code outputs a header line, then one tab-delimited line per taxon. The first field is the EoL page ID (or “*” if no EoL page can be found, or “-” if no relevant OpenTree numbers exist for that taxon); if there are multiple EoL pages for the same taxon, the number is prefixed with a “+” sign. The second field is the OpenTree ott number. The third is the OpenTree name. All following fields correspond to the EoL page IDs for different OpenTree providers, as specified in the header. These should all be the same as the first field, unless EoL has (say) different pages for the same taxon from GBIF versus from IRMNG.

ott2eol.pl (code in the public domain)

#!/usr/bin/perl -sw
use strict;
use warnings;
use LWP::Simple;
use JSON -support_by_pp;
use Try::Tiny;
use List::MoreUtils qw/ uniq /;
 
my $n = $ARGV[2] || 0; #$n=0 means use all lines
srand($ARGV[3]) if $ARGV[3];
 
open(TAXA, $ARGV[1])
   or die "Cannot open list of taxa ".$ARGV[1]." for reading: $!";
 
#reservoir sampling algorithm from http://data-analytics-tools.blogspot.co.uk/2009/09/reservoir-sampling-algorithm-in-perl.html
my @taxa = ();
{
    local $/ = "_ott";
    my $i=0;
    while () {
        if (/(^\d+)/) {
            $i++;
            if ($n==0 || @taxa < $n) {
                push(@taxa, $1);
            } elsif ((rand() < $n/$i)) {
                $taxa[int(rand(@taxa))] = $1;
            }
        }
    }
}
close(TAXA);
 
my %taxa = map { $_ => 1 } @taxa;
 
open(TAXONOMY, "<", $ARGV[0])
	or die "Cannot open taxonomy.tsv file ".$ARGV[0]." for reading: $!";
 
#later elements have priority
my @hierarchyIDs = (['irmng',1347],['if',596],['gbif',800],['worms', 123],['ncbi',1172]); #from http://eol.org/api/docs/provider_hierarchies: Extant & Habitat resource = IRMNG.
my %sources = map {$_->[0]=>[]} @hierarchyIDs;
my $match = join("|", map {$_->[0]} @hierarchyIDs);
print join("\t", ('EoLbestguess', 'ottID', 'name', map {$_->[0]} @hierarchyIDs))."\n";
 
my %params = (key=>"0e8786f5d94e9587e31ed0f7703c9a81f3036c7f", #replace with your own API key.
              cache_ttl=>1000);
while() {
    if (/^(\d+)/) {
        if (exists $taxa{$1}) {
            my @fields = split /\t\|\t/;
            my @line = ("-", $fields[0], $fields[2]); # - means no relevant heirarchy number on OpenTree
            while($fields[4] =~ /\b($match):(\d+)/g) {
                $line[0] = "*"; # * means no ids for this provider on EOL
                push(@{$sources{$1}}, $2);    #could have e.g. irmng:1467136,irmng:1288688, so convert into hash of arrays
            }
            my $conflict=0;
            foreach my $h (@hierarchyIDs) {
                $params{'hierarchy_id'}=$h->[1];
                my @EOLids = ();
                while (defined(my $id = pop(@{$sources{$h->[0]}}))) {
                    my $url = "http://eol.org/api/search_by_provider/1.0/".$id.".json?".join("&", map{"$_=$params{$_}"} keys %params);
                    my $sp = fetch_json_page($url);
 
                    if (defined $sp->[0]{"eol_page_id"}) {
                        $conflict = 1 if (($line[0] ne "*") && $line[0] != $sp->[0]{"eol_page_id"});
                        $line[0] = $sp->[0]{"eol_page_id"};
                        push @EOLids, $line[0];
                    }
                }
                if (@EOLids) {
                    push(@line, join(",", uniq @EOLids));
                } else {
                    push(@line, "*"); # * means no ids for this provider on EOL
                }
            }
            $line[0] = "+".$line[0] if ($conflict);
            print join("\t", @line)."\n";
        }
    }
}
 
 
sub fetch_json_page
{
  my $json = new JSON;
  my ($json_url) = shift;
  # download the json page:
  my $json_text;
  my $content = get( $json_url );
  if (defined($content)) {
    try {
    # these are some nice json options to relax restrictions a bit:
      $json_text=$json->allow_nonref->utf8->relaxed->escape_slash->loose->allow_singlequote->allow_barekey->decode($content);
    } catch {
      warn "In string \"$content\" - JSON error: $_ \n";
    };
  };
  return $json_text;
}

One thought on “Mapping OpenTree names to Encyclopedia of Life IDs

Leave a Reply

Your email address will not be published. Required fields are marked *