Spotting maps among images of organisms

58206_orig

How do you get a computer to distinguish pictures like the fallow deer at top from distribution maps (bottom)?

I’ve been writing some code to download freely usable images on the internet for large numbers of organisms – for example, for all species of mammal. The Encyclopedia of Life has done a lot of the hard work already – collecting images from Wikimedia Commons, Flickr, etc., encouraging experts to tag them as trusted, and providing an API for retrieving all the relevant data.

One problem is that a small percentage of the automatically harvested pictures are not pictures of the organism, but maps of its distribution, as seen in the lower picture on the right. Is there a way to automatically identify these as maps, or at least to flag up that they might need checking?

Plot of images based on amount of compression by different image formats

Separating types of images based on amount of compression – 738 mammals. Points are labelled with their EoL dataObject ID (visible in the high-res pdf)

One potential way to do this is to assess the information content of the image in some way. Distribution maps, with their large spaces of uniform colour, have a very different infomation layout to photographs, which have graded colours throughout. That’s something which is taken into account by image compression algorithms: for example, jpeg is designed for compressing photos. GIF is better for line drawings, and so forth. So it might be possible to use the effectiveness of compression algorithms to distinguish between maps and non-maps. The plot here shows that this works pretty well.

For those who want to try this out, the plot (and methods for exploring similar classifications) can be pretty elegantly done using PerlMagick and R. The main downside is it can take some effort to get the Image::Magick module working in perl. But once that’s done, something like the following perl script should get you the data without having to go through the hassle of actually saving the different file formats to disk:

#!/usr/bin/perl -w
use strict;
use Image::Magick;

print "name\traw\tjpg\tpng\tgif\n";
my $im = new Image::Magick;
foreach (<>) {
	chomp;
	$im->Read("$_");
	my ($w, $h) = $im->Get('width', 'height');
	print $_."\t".$w*$h."\t";
	print length(($im->ImageToBlob(magick=>'jpeg'))[0])."\t";
	print length(($im->ImageToBlob(magick=>'PNG'))[0])."\t";
	print length(($im->ImageToBlob(magick=>'GIF'))[0])."\n";
}

Once you have identified the map images by handaI named the images by their EoL “dataObject ID”, then flagged them up in my file browser, and pasted the the image numbers in R, into a variable I called “maps”., you can then calculate the compression amount (jpg÷raw, gif÷raw, etc.) and have a look at how well the images cluster in 3D by using R. I used something like the following R code:

library(rgl)
imgBytes <- read.delim("output", stringsAsFactors=F)
imgBytes$ID <- as.numeric(substring(imgBytes$name, 0, nchar(imgBytes$name)-4))
maps <- c(5853197, 5862428, 5864628, 5864880, 5869829, 5871815,
5884215, 5887367, 5898421, 5901012, 5902027, 5903272, 5906325, 
5911164, 5912140, 5915522, 5916418, 5916421, 5922295, 5922967, 
5923319, 10070487, 14840166, 17273184,19678272)
plot3d(log(imgBytes$jpg/imgBytes$raw),
log(imgBytes$png/imgBytes$raw),
log(imgBytes$gif/imgBytes$raw),
col=ifelse(imgBytes$ID %in% maps, "red", "black"), size=3)

It takes a little (but not much) further work to identify the other types of image by hand in order to produce the plot above.

Notes   [ + ]

a. I named the images by their EoL “dataObject ID”, then flagged them up in my file browser, and pasted the the image numbers in R, into a variable I called “maps”.

3 thoughts on “Spotting maps among images of organisms

  1. Pingback: Spotting maps, part II | A Scientific View

  2. Pingback: Human impressions of animal sounds | A Scientific View

  3. Pingback: Schrëwdinger and her descendants | A Scientific View

Leave a Reply

Your email address will not be published. Required fields are marked *