How effectively can a computer distinguish pictures or drawings of organisms from maps of their distribution? Based on my previous thoughts on recognising maps, a simple statistical technique allows a computer to correctly identify 99% of maps from within a training set of 1210 images (including 272 maps). Pleasingly, this classification has only a 0.5% false positive rate.
Pretty good, but in the case of images submitted to the Encyclopaedia of Life, we can do better. If we make a guess as to the original format of the image, and include this into the model, we can correctly separate all 272 maps from the 938 pictures and drawings in my particular dataset. If you want to try it out, the dataset is here, and the R code to perform the classification is near the end of this post.
Of course, my algorithm is specifically tailored to the set of images I have selected. I tried to include a variety of plants, animals, line drawings, sketches, etc, but my success might simply be a result of choosing an easily distinguished set of images. I must sound another note of caution too. With enough variables, and using complex fitting routines, it’s always possible to construct a statistical model that will correctly predict a known outcome. I have tried to keep the model relatively simple, in the hope this will make it general. The real test, however, will be to test the classification on unknown images.
Nitty gritty details
I have used is one of the most well known classification in the techniques in the statistician’s toolbox: logistic regression. This is suitable where the outcome we wish to predict is a binary response, e.g. “image is a not map” versus “image is not a map”. In this case, it allows us to calculate the probability of an image not being a map, based on a number of predictor variables.
To work out which images are likely to be maps, we then just pick all those with a probability above a certain cutoff value, such as 0.5. If we wish to be more conservative, we can pick higher values, which will lead to fewer false positives, but more false negatives.
Partially as a result of trial-and erroraI could be accused of data dredging here…, I chose 5 predictor variables: the (log of the) amount of compression achieved when saving the images as
- JPG (based on the raw jpg files provided by EoL, which are at 80% compression)
- PNGbConverted using Imagemagick, via
mogrify -format png *.jpg
- TIFF (with LZW compressioncConverted using Imagemagick, via
mogrify -format tiff -compress LZW *.jpg)
- lossless JPEG2000dConverted using Imagemagick, via
mogrify -format jp2 -compress lossless *.jpg
- The log of the aspect ratio of the image.
From the plots of the data in my previous post, it didn’t seem as if maps simply lie on one end of the compression spectrum, and images on the other end. Instead, maps often seem to display somewhat medium amounts of compression. That means we should permit these variables to contribute in a quadratic manner – allowing intermediate compression amounts to be associated with the maximum probability of a map.
Finally, there is no reason to assume that each of the variables should contribute independently to the probability. There are likely to be interactions between them. However, including all possible interactions between 5 (quadratic) variables gives rise to a huge number of fitting parameters. Instead, I have restricted the fit to only 2-way interactions. Unfortunately that still leaves a model with 50 fitted parameters – a source of valid criticism, I feel.
The strength of the predictive model can be judged by the Akaike Information Criterion (lower is better), or more crudely by tabulating the number of correctly predicted maps & images and the numbers of false positives and negatives. The following R code does all this: you should be able to copy and paste it directly into R to carry out the classification yourself.
read.table(url("http://yanwong.me/?txt=eolmaptestdata"), header=TRUE) -> mapdata
# convert raw sizes into log(compression amount)
TransformedData <- transform(mapdata, jpg =log(jpg/(width*height*3)), png =log(png/(width*height*3)), tif =log(tif/(width*height*3)), jp2 =log(jp2/(width*height*3)), aspectRatio=width/height)
# fit the logistic regression model
model1 <- glm(imageType ~ (poly(jpg,2)+poly(png,2)+poly(tif,2) +poly(jp2,2)+ poly(log(aspectRatio),2))^2, TransformedData, family=binomial(logit), control = list(maxit = 50));
table(actual=TransformedData$imageType, predicted.model1=ifelse(predict(model1, type="response")<0.5, "Map", "Not map"));
cat('AIC=',AIC(model1)) #output the AIC, to see how we are doing. Could try simplifying the model using drop1(model)
# Now try adding the inferred file type, as a simple independent factor
model2 <- glm(imageType ~ (poly(jpg,2)+poly(png,2)+poly(tif,2) +poly(jp2,2)+ poly(log(aspectRatio),2))^2 + inferredFiletype, TransformedData, family=binomial(logit), control = list(maxit = 50));
table(actual=TransformedData$imageType, predicted.model2a=ifelse(predict(model2, type="response")<0.5, "Map", "Not map"))
#prob an artifact of perfect fitting, but AIC improves by treating tif compression as a linear, rather than quadratic predictor
model2 <- glm(imageType ~ (poly(jpg,2)+poly(png,2)+tif +poly(jp2,2)+ poly(log(aspectRatio),2))^2 + inferredFiletype, TransformedData, family=binomial(logit), control = list(maxit = 50));
table(actual=TransformedData$imageType, predicted.model2b=ifelse(predict(model2, type="response")<0.5, "Map", "Not map"))
Only for Encyclopedia of Life geeks
For any particular point in my dataset, you can look at the image on the Encyclopedia of Life by using its dataObjectID, for example, by using a URL like http://eol.org/data_objects/dataObjectID. You could even use R to browse (for example) all the false positives from model1 using something like this:
falsePositives <- subset(TransformedData, imageType == "Not_map" & predict(model1, type="response")<0.5)$dataObjectID
sapply(paste("http://eol.org/data_objects", falsePositives, sep="/"), browseURL)
Inferring original filetypes on EoL
The filetype of the original image was inferred using the EoL dataObjects API, and looking at the following fields (in order of priority)
- mimeType (but ignored if this is “image/jpeg”)
- an image suffix (e.g. “.jpg”, “.png”., “.svg”) in the “source” field
- an image suffix (e.g. “.jpg”, “.png”., “.svg”) in the “mediaURL” fieldeThis is complex, as the URL could have cgi parameters appended to it too. In the unlikely event that you might want to implement this, contact me!
- an image suffix (e.g. “.jpg”, “.png”., “.svg”) in the “title” field
Notes [ + ]
|a.||↑||I could be accused of data dredging here…|
|b.||↑||Converted using Imagemagick, via
mogrify -format png *.jpg
|c.||↑||Converted using Imagemagick, via
mogrify -format tiff -compress LZW *.jpg
|d.||↑||Converted using Imagemagick, via
mogrify -format jp2 -compress lossless *.jpg
|e.||↑||This is complex, as the URL could have cgi parameters appended to it too. In the unlikely event that you might want to implement this, contact me!|