Human impressions of animal sounds

If a friend tries to do an impression of an animal call, how easy is it to work out the animal? Alternatively, how good are different people at making animal noises? If a computer could assess the accuracy of these impressions, it would open up some fun possibilities. Imagine searching a database of animal sounds, or a set of known animals in a nature reserve, simply by doing an appropriate impression. I haven’t been able to find many people researching this topic, but I reckon I’m halfway to solving it. I just need a little extra help with some image analysis.

Spectrogram of animal imitations

My impressions of animals: not too far from the real thing, although I definitely can’t squeak as high as a guinea pig.

Taking a few example sounds of mammals from EoL, I sat down in Heathrow terminal 1 and made gibbon, hyena, and guinea-pig impressions into the microphone on my computer. The spectrogram of the real noises is above, and of my impressions is on the left. To my untrained eye, they look quite similar. Certainly if I was asked to match the impression against each possible real noise, I reckon I would get most of them right, save perhaps the guinea pig or cavy.

I’ve come across one research project that tried to match animal sounds on the basis of the spectrogram image itself. They simply assess how much information about the target spectrogram is present in the original version, by compressing the two images as successive frames of an MPEG movie. Movie files compress well because successive frames often contain the same object, shifted into a different place – say a car moving across the screen. MPEG compression tries to detect this, so it can store most of the information in subsequent frames in terms of how the original picture has been shifted. While it’s a fun idea to use compression to assess information content in images, there’s an inherent problem in applying this to spectrograms, because they can look radically different simply depending on the parameters used to plot them – the frequency range, the decibel range, cutoffs, etc.

Nevertheless, the project authors claim a pretty good success rate, so as a first pass, I compressed greyscale versions of the above spectrograms as two-frame mpeg files, using ffmpeg. Unfortunately, it doesn’t work. Here, for example, are the compression measurements (the “CK” score) of each of the different real animal spectrograms compared against my impression of a wolfaFor reproducibility, the R code is

library(tuneR)
library(seewave)

MP3audiofile_dir = "~/Audio/"

names <- c("Wolf", "Hyena", "Panda", "Cavy", "Gibbon")

real <- lapply(paste(MP3audiofile_dir, names, ".mp3", sep=""), function(x) readMP3(x))

impressions <- lapply(paste(MP3audiofile_dir, names, "_i.mp3", sep=""), function(x) readMP3(x))

sapply(1:5, function(i) {
png(paste(names[i], ".png", sep=""), 500,500)
par(mar=c(0,0,0,0))
spectro(real[[i]], pal = grey.colors, colbg="black", scale=F, grid=F, flim=c(0,10))
dev.off()
png(paste(names[i], "_i.png", sep=""), 500,500)
par(mar=c(0,0,0,0))
spectro(impressions[[i]], pal = grey.colors, colbg="black", scale=F, grid=F, flim=c(0,10))
dev.off()
})

mpegSize <- function(file1, file2) {
 listfile <- tempfile(pattern = "soundmatch", tmpdir = tempdir(), fileext = ".txt")
 videofile <- tempfile(pattern = "soundmatch", tmpdir = tempdir(), fileext = ".mpeg")
 fileConn<-file(listfile)
 writeLines(paste("file '", c(normalizePath(file1),normalizePath(file2)), "'", sep=""), fileConn)
 close(fileConn)
 system(paste("ffmpeg -y -f concat -i", listfile, "-pix_fmt yuv420p", videofile), F, T, T)
 sz <- file.info(videofile)$size
 unlink(listfile)
 unlink(videofile)
 return (sz)
}

CK <- function(x, y) {
( ( mpegSize(x, y) + mpegSize(y, x) ) / ( mpegSize(x, x) + mpegSize(y, y) ) ) - 1;
}

compare <- function (focal, names) {
 sapply(names, function(x) {
 cmpnames <- paste("~/Documents/Research/EoL/Audio/", c(focal, x), ".png", sep="")
 CK(cmpnames[1], cmpnames[2])
 })
}

compare("Wolf_i",names)

:

     Wolf     Hyena     Panda      Cavy    Gibbon 
0.4000000 0.3571429 0.4400000 0.2727273 0.3571429


Not the same.

As you can see, by this measure, my wolf impression matched the cavy sound best. Damn. I think there are a few reasons why my crude first-pass failed. Firstly, I was taking the spectrograms of the whole sound, rather than short clips of the same length. Even so, looking at the spectrograms by eye I don’t think this should have been a problem. You get pretty much the same matching problems when using shorter, similarly-sized clips.

More problematic are the details of the compression used. It’s actually up to the individual MPEG compression programs to work out how best to spot similarities between images. It’s not something in the MPEG specifications. The task of spotting a moving image is rather different to comparing spectrograms. For instance, we wouldn’t want to look in spectrograms for slightly rotated versions of graphical patterns. There are only certain operations on the spectrogram image that could reasonably be considered to preserve the nature of the sound: right-left and up-down shifts, for instance. Moreover, the up-down shifts are not simply linear. If the noise is shifted up or down by a few octaves, then the patterns will need to be stretched or shrunk on a log scale. In fact, it might be better to use a log-frequency spectrogram, but annoyingly, the seewave package in R doesn’t have this functionality.

So at the moment I’m stuck. With a simple glance,  I can make a guess at the similarity between two spectrogram traces. And it seems like it should be possible for a computer to do so too. But I don’t know how best to code this up. Perhaps there’s something up in openCV that might work. Suggestions please!

Notes   [ + ]

a. For reproducibility, the R code is

library(tuneR)
library(seewave)

MP3audiofile_dir = "~/Audio/"

names <- c("Wolf", "Hyena", "Panda", "Cavy", "Gibbon")

real <- lapply(paste(MP3audiofile_dir, names, ".mp3", sep=""), function(x) readMP3(x))

impressions <- lapply(paste(MP3audiofile_dir, names, "_i.mp3", sep=""), function(x) readMP3(x))

sapply(1:5, function(i) {
png(paste(names[i], ".png", sep=""), 500,500)
par(mar=c(0,0,0,0))
spectro(real[[i]], pal = grey.colors, colbg="black", scale=F, grid=F, flim=c(0,10))
dev.off()
png(paste(names[i], "_i.png", sep=""), 500,500)
par(mar=c(0,0,0,0))
spectro(impressions[[i]], pal = grey.colors, colbg="black", scale=F, grid=F, flim=c(0,10))
dev.off()
})

mpegSize <- function(file1, file2) {
 listfile <- tempfile(pattern = "soundmatch", tmpdir = tempdir(), fileext = ".txt")
 videofile <- tempfile(pattern = "soundmatch", tmpdir = tempdir(), fileext = ".mpeg")
 fileConn<-file(listfile)
 writeLines(paste("file '", c(normalizePath(file1),normalizePath(file2)), "'", sep=""), fileConn)
 close(fileConn)
 system(paste("ffmpeg -y -f concat -i", listfile, "-pix_fmt yuv420p", videofile), F, T, T)
 sz <- file.info(videofile)$size
 unlink(listfile)
 unlink(videofile)
 return (sz)
}

CK <- function(x, y) {
( ( mpegSize(x, y) + mpegSize(y, x) ) / ( mpegSize(x, x) + mpegSize(y, y) ) ) - 1;
}

compare <- function (focal, names) {
 sapply(names, function(x) {
 cmpnames <- paste("~/Documents/Research/EoL/Audio/", c(focal, x), ".png", sep="")
 CK(cmpnames[1], cmpnames[2])
 })
}

compare("Wolf_i",names)

2 thoughts on “Human impressions of animal sounds

  1. Forget searching & matching database entries by doing impressions, you’re on to the next generation of CAPTCHAs here.

    “Prove you’re a human by matching these frantic human squeals to the corresponding familiar animal call.”

    Naturally, you’ll mine the resulting data to build the pattern matching process you were after in the first place. “We feed the rats to the cats and the cats to the rats and get the cat skins for nothing!” – Willis B. Powell, 1875

Leave a Reply

Your email address will not be published. Required fields are marked *