Visualizing data on large phylogenies at the pixel-level

PlacentalPicDistribution

A phylogenetically organised display of data for all placental mammal species. Red pixels are those without a picture on EoL.

Modern technology, coupled with molecular taxonomy, means we now have very large evolutionary trees: ones with tens or hundreds of thousands of species. In fact, the Open Tree of Life project aims to create a tree of all living things, which would have millions of species. The obvious question is how to display these enormous amounts of data.

Here’s one possibility I’ve come up with: use a single pixel for each tip of the tree (each species). Then, if we could use the whole of a one megapixel canvas, we could display information about a million or so species. The nice thing about this is also that the information is balanced: each data point corresponds to a single species, without species on some branches being given greater prominence than those on others.

Most evolutionary trees aren’t laid out in a way that makes it possible to use the whole canvas, because their tips are conventionally in a single line (or sometimes bent round to give a circle). But OneZoom trees differ, in that the tips can occur in various positions in 2D space. It’s true that tips on these trees don’t fill the space uniformly: some areas of the canvas are densely tiled with fractal branches, whereas other parts are sparsely populated. Nevertheless, the tips on OneZoom trees occupy a more extensive area of space than on conventional trees.

TetrapodIUCN1

IUCN red-list data for all the tetrapods. Green points are least concern. The redder the point, the more endangered the species (black pixels indicate already extinct species). Grey pixels are where data are missing or deficient. It’s clear that there is considerable heterogeneity in both coverage (white areas) and extinction risk (red areas) — it’s not looking good for turtles or amphibians.

So as an initial pass, the picture above shows a data point for the every one of the ~22000 tetrapod species, overlaid on a (very faint) OneZoom tree. I’ve plotted IUCN status simply because it was easily available data. More interesting plots could be made for e.g. body size or other species traits.

Each species on this tree has been allocated a pixel, placed as close as possible to the species’ original position on the tree. Well, that’s not quite true, because finding the optimal placementae.g. that which minimises the squared distance between original point and corresponding pixel centres is hard: the number of possible mappings of points to points on a pixel grid is astronomically huge. It would be practically impossible to search through all these combinations, and even some of the clever mathematical shortcuts that might allow us to solve this “assignment problem” are likely to baulk at the millions of possible pixel placements involved. Instead, I’ve implemented a version of Keim’s “GridFit” algorithm  to place the pixels reasonably close to their original positionsbThis could probably be improved by a second pass which moves points to any near yet still empty pixel spots..

I’ve used the word “pixel”, but this can be generalised to a square of any size, so it is possible, for example, to display the data as a single pixel in the middle of a 3×3 square of otherwise blank pixels, which allows the tree to be seen underneath. Another way of making the underlying tree visible is to use some form of transparency, of course.


PicTest
It’s not just data visualisation either. In my last blog post I talked about using images from the Encyclopedia of Life in various ways. If, instead of pixels, we use small thumbnail images, this allows us to place large numbers of images of organisms onto a tree. So on the left, for example, is a screenshot of a pdf showing pictures of all placental mammal species which have a freely reusablecEither in the public domain, or licensed under Creative Commons “attribution only”. picture available on the Encyclopedia of Life. I’ve restricted it to showing only images with a quality rating greater then the default 2.5 stars. I’d encourage you to download the 3 Mb pdf to have a look – many of the pictures are quite stunning. You can find out more about the creatures by clicking on them, or look at the original picture itself by clicking the text on the bottom left of each miniature photo.dNote that there are a few anomalies, such as the swapping of the hyraxes and sea cows. I’ll correct this in a week or two.

PlacentalPDCCBYPicDistribution

Phylogenetic distribution of public domain and CC-BY photos of placental mammal species on the Encyclopedia of Life. Each coloured point is a species: red indicates no photo, green a trusted photo, yellow an unverified one. Depth of green and yellow indicate quality of photo.

That brings me to the picture with which I started this post. I’m often trying to locate good quality pictures of organisms. How many are there available out there, and how does that change over the tree of life? The picture at the top of this post uses my pixel display idea to give the answer for the placental mammals. Red pixels are species for which no image is available on the Encyclopedia of Life. Green pixels are “trusted” species whose identification has been verified. Yellow are “unverified” pictures (most of these are probably correctly identified, however). The depth of colour of the green and yellow pixels represents the image quality (EoL rating) of the best available photo. Above is a similar plot but just for freely usable photos (those which are either in the public domain or under the least restrictive of the Creative Commons attribution licenses). Looks like shrews and rodents need a bit of photographic love.

As you can see from these plots, OneZoom fractal trees are not ideal for this sort of display. There are too many species concentrated into a localised space at the tips of the tree. It would be good to find some tree-drawing algorithms that spread out the tips of the tree more evenly in space. Any suggestions?

References

Notes   [ + ]

a. e.g. that which minimises the squared distance between original point and corresponding pixel centres
b. This could probably be improved by a second pass which moves points to any near yet still empty pixel spots.
c. Either in the public domain, or licensed under Creative Commons “attribution only”.
d. Note that there are a few anomalies, such as the swapping of the hyraxes and sea cows. I’ll correct this in a week or two.

Leave a Reply

Your email address will not be published. Required fields are marked *