Reproducible plots of public data (Guardian Google spreadsheets, UNdata) using R

There’s a fair bit of of publicly available data that can be used to answer questions about global trends. Previously, I’ve used data from the UN and the UK (often the Office for National Statistics), but I’ve only just discovered the commendable data service being provided by the British paper “The Guardian“. They encourage public access to (and visualization of) the data on which their articles are based. This is done by tidying up data and putting it in publicly accessible Google spreadsheets.

Plot reproducible from free online data using R

With all this data available, it’s possible to carry out reproducible research and analysis, such as the plot opposite. For this you require an easy (and free) way to to carry out analysis using the datasets. Perhaps the most powerful way is to use something like R, which can be issued commands to carry out analysis, generate visualizations, etc. For maximum reproducibility, I’ve been trying to write R code that accesses these data sources directly, without having to download and tweak intermediate data files. This should make it easy to analyse datasets and produce attractive plots – for example, of estimated life expectancy and population size. If you’re not interested in the details, you might want to skip straight to the code to paste into your R session.

Data stored in online text (e.g. CSV) files

The read.table function, and its variants such as read.csv, are the main way of importing data tables into R. Rather pleasingly, they can accept URLs as input sources, as well as files on your own computer. Many datasets are made available as csv files, so you might be able to do read.csv("http://data.org/interestingdatafile.csv"). The difficulty then becomes finding the right file: you can explore some from the UN at http://data.un.org/Explorer.aspx#topicsaI’m using this UN data as an example of text-file download, but there is also an API for retrieving these data without going via text files. However, this currently requires registration.

SVG output from the code below

As an example, take the demographic statistics on gender and population collected by the UNSD, used in the plot opposite. The data can be inspected here, and then filtered to focus on male and female countrywide totals for (say) 2010, as well as outputting standardized country codes (useful when merging data from several sources). The UN page even gives an option to download a csv formatted version of the table.

Unfortunately, things are rarely as simple as you’d like. The UN provides this csv data packaged inside a zip archive. So annoyingly, it’s simplest to temporarily download the file before accessing it in R, something like thisbYou should be able to just copy this code and paste it straight into R: the ‘>’ symbol which may appear at the front of each line is content generated using css magic, so shouldn’t get copied when you select the code.

read.csvzipURL <- function(url, ...) { #or use getZip from the Hmisc package 
 temp <- tempfile()
 download.file(url, temp, mode="wb")
 zipFileInfo <- unzip(temp, list=TRUE)
 if(nrow(zipFileInfo) > 1) stop("More than one file inside zip")
 data <- read.csv(unz(temp, as.character(zipFileInfo$Name)), ...)
 unlink(temp) #delete tempfile
 return(data)
}

URL <- 'http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=tableCode:1,tableCode:1;refYear:2010;areaCode:0;sexCode:1,2&DataMartId=POP&Format=csv&c=1,2,3,5,7,9,11,12,13&s=_countryEnglishNameOrderBy:asc,refYear:desc,areaCode:asc'
PopBySex2010 <- read.csvzipURL(URL)

Now that the data is stored in the PopBySex2010 dataframe, we can jiggle it about and create the revealing plot above

#First restructure the data for ease of use
require(reshape) #you may need to install this package by doing install.packages('reshape')
MaleVsFemalePop<- melt(subset(PopBySex2010, !is.na(Value)), id=c("Country.or.Area", "Sex", "Value"))
#some countries have multiple estimates for 2010 - just take the mean of all these
Pop <- cast(MaleVsFemalePop, Country.or.Area ~ Sex, fun.aggregate=mean, value='Value')
bottom10 <- tail(Pop[order(Pop$Male/(Pop$Male+LE$Female)),],10) #sort by sex ratio
#require(cairoDevice); Cairo_svg("PopGender10.svg") #uncomment to plot to an svg file. Don't forget to dev.off() at the end
par(mar=c(4.5,10,4,1),mgp=c(2,1,0), las=1, col.sub='grey', cex.sub=0.8) #make space for country names etc.
lab <- with(bottom10, sprintf("%s (%0.1f%%)", Country.or.Area, Male/(Male+Female)*100))
barplot(t(as.matrix(bottom10[,c('Female','Male')]))/1000000, main='The 10 countries with the highest\npercentage of males in 2010', beside = TRUE, horiz=TRUE, names.arg=lab, col=c('pink', 'lightblue'), xlab='Number of people (millions)', sub='(data from UNSD Demographic Yearbook)')
legend("bottomright", inset=.05, c("Males","Females"), fill=c('lightblue', 'pink'))

Getting data from online Excel spreadsheets

SVG output from the code below

Data files, such as those made available by the UN World Population Prospects, are sometimes downloadable as Excel spreadsheets. Fortunately, the “gdata” package in R provides a function “read.xls” which not only will read the Excel files provided by the UN, but can also take a URL as the first argument. The only issue with this package is that it requires perl to be installed – which it is by default on OS X and most Linux distributions. If you’re doing this sort of data processing in Windows, you should probably install perl anyway. I use it all the time. Once you have found the URL for the data you wantcin my case, “http://esa.un.org/wpp/excel-Data/DB02_Stock_Indicators/WPP2010_DB2_F01_TOTAL_POPULATION_BOTH_SEXES.XLS“, which lists the populations of countries around the world, from 1950-2010, you’ll probably have to look for which rows and columns contain the correct data. In the following example, the header row is the first row to contain the word “Index”

require(gdata)
URL <- "http://esa.un.org/wpp/excel-Data/DB02_Stock_Indicators/WPP2010_DB2_F01_TOTAL_POPULATION_BOTH_SEXES.XLS"
uniqueHeaderText <- "Index"
WorldPop <- read.xls(URL, stringsAsFactors=FALSE, pattern=uniqueHeaderText, header=TRUE)
# the numbers have a space as the thousands separator. Remove these and convert columns to numeric where possible
WorldPop <- data.frame(lapply(WorldPop, function(x) {tryCatch(as.numeric(gsub(' ','',x)), warning=function(e) x)}), stringsAsFactors=FALSE)

After that, it’s relatively easy to produce plots, for example, the plot above of the 10 most populous countries in 2010:

#require(cairoDevice); Cairo_svg("pop.svg", width=4.8, height=4.8) #uncomment to output an svg file, remember to dev.off() at the end
CountryPopns <- subset(WorldPop, Country.code < 900)
Top10 <- tail(CountryPopns[order(CountryPopns$X2010),], 10)
lab <- sub('United States of America', 'USA', Top10$Major.area..region..country.or.area)
par(mar=c(5,8,2,1),mgp=c(2,1,0), las=1, col.sub='grey', cex.sub=0.8)
barplot(Top10$X2010/1000, names.arg = lab, horiz=TRUE, cex.names=0.8, main="Most populous countries", sub="(data from UN WPP)", xlab="Population (millions) in 2010")
#dev.off()

Getting data from Google spreadsheets

Currently, Google spreadsheets, such as those provided by the Guardian, are accessed via a URL such as the following:  https://docs.google.com/spreadsheet/ccc?key=0AhORuxOwZhGydFlsT3VlamppWmJHSzFCeG1JRWJyamc. If you have a look at this particular one, you’ll find it contains data culled from a paper in The Lancet as part of the Global Burden of Disease 2010 initiative . There are a number of ways to tap into these Google Spreadsheets from within R:

  1. Access the URL directly. This needs a link to the data in plain text (e.g. csv) format. If you don’t own the spreadsheet, you won’t be able to change publishing settings (as per the instructions here), so finding a plain-text link is less obvious: at the time of writing it can be done by appending “&output=csv” to the Google URL. That gives a link that can be used in a web browser. However, there are still 2 issues trying to use such a link in R:
    1. The read.csv method in R can’t access files over secure https:// connections, which is what Google spreadsheets now use. That means using something like RCurl to access the file.
    2. The actual URL gets redirected by Google, and does some funny shenanigans with cookies. This can be worked around by setting the RCurl option “followlocation” to TRUE, and “cookiefile” to some non-existent file.

    At the moment, the following works for me (R 2.12 on OS X)

    require(RCurl)
    URL <- "https://docs.google.com/spreadsheet/ccc?key=0AhORuxOwZhGydFlsT3VlamppWmJHSzFCeG1JRWJyamc"
    myCsv <-getURL(paste(URL,"&output=csv", sep=""), .opts=curlOptions(followlocation=TRUE,cookiefile="nosuchfile"), ssl.verifypeer=FALSE)
    LifeExpectancies <- read.csv(textConnection(myCsv), stringsAsFactors=FALSE)

    But although this works for now, it’s sensitive to how Google implements access to its public spreadsheets. The next method wraps all that complexity up (i.e. someone else has to deal with maintaining compatibility!).

  2. Use RGoogleDocs. This R package also requires RCurl to have been installed, but gives an slicker high-level interface. I installed the package by typing install.packages("RGoogleDocs", repos = "http://www.omegahat.org/R", type="source") within R. On Windows, this will probably require you to install tools for compiling stuff.
    Then you can access public Google spreadsheets by doing something like:
require(RGoogleDocs)
key <- "0AhORuxOwZhGydFlsT3VlamppWmJHSzFCeG1JRWJyamc"
ws <- getWorksheets(publicWorksheet(key), NULL)
LifeExpectancies <- sheetAsMatrix(ws[[1]], header=TRUE, con = NULL, stringsAsFactors=FALSE)
  • Use RGoogleData. This is another package, which doesn’t access the data as csv text, but uses the Java API provided by Google. That seems like a less hacky solution to me, but the package doesn’t appear to have been updated recently. Worse, it no longer builds, and I think that’s why I can’t install it from R-Forge. At any rate, the following fails on my computer:
    install.packages("RGoogleData", repos="http://R-Forge.R-project.org", type="source")

    Moreover, I can’t find any reference to it in the available packages:

    pkgs <-available.packages(contriburl=contrib.url("http://r-forge.r-project.org", "source"))
    grep('goog', pkgs[,'Package'], ignore.case = TRUE, value=TRUE) 
    named character(0)

    Shame. Perhaps it will be updated in the future.

Whatever means you use to obtain the data, once you have it, it should be easy to plot out informative patterns. For example, the differences in life expectancy between males and females (as with the UN data, I’ve found the  “reshape” package handy to re-jig the layout):

require(reshape)
# Make spreadsheet column names easier to deal with
colnames(LifeExpectancies) <- make.names(colnames(LifeExpectancies))
LE <- melt(LifeExpectancies, id=c("Country", "Year", "Male.Female"))
#just select data for 2010
LE2010 <- cast(LE, Country ~ variable+Male.Female | Year)[['2010']]
LE2010$MFdiff <- LE2010$Life.expectancy_Female - LE2010$Life.expectancy_Male
Worst10 <- head(LE2010[order(LE2010$MFdiff),], 10)
#require(cairoDevice); Cairo_svg("MFdiff.svg", width=4.8, height=3) #uncomment to output an svg file, remember to dev.off() at the end
par(mar=c(4,2,3,1),mgp=c(2,1,0), las=1, col.sub='grey', cex.sub=0.8)
barplot(Worst10$MFdiff, horiz=TRUE, space=0.2, col="magenta", border=NA, main='Sex differences in life expectancy at\nbirth (bottom 10 countries, 2010)', sub='[data from GBD 2010: Lancet (2012) 380:2071]', xlab='Female minus Male LE (years)', xlim=c(-2,4))
for(i in seq(nrow(Worst10))) text(if(Worst10$MFdiff[i] < 0) 0.1 else -0.1, i*(1+0.2)-0.5, strsplit(Worst10$Country[i],',')[[1]][1], adj=if(Worst10$MFdiff[i] < 0) 0 else 1, cex=0.7)

SVG output from the code above

Which produces the plot shown here. In this case, there’s a major thing missing from these plots, and that’s some estimation of how accurate these figures are. In the original Lancet paper , one of the major advances was providing confidence intervals for these estimates. Unfortunately, the Guardian have not included these figures in their summary spreadsheet, so you’ll have to go to the original paper to get them, which requires logging in. I’m slightly disappointed that the Lancet don’t make this sort of dataset publicly accessible for automated analysis.

Combining data from multiple sources

SVG output from the R code below

My original intention was to create something like the plot to the right, which is similar to one on wikipedia at the time of writing this. That requires using data from multiple online sources. And that means  matching the datasets up: not necessarily an easy task when one might list a country as “TFYR of Macedonia” whereas the same country in another dataset is “Macedonia, the Former Yugoslav Republic of”. For that reason, it is useful to use official codes for entities such as countries. These codes are present in the R package ISOcodes, which also (helpfully) gives the M.49 region codes used by the UN, which (with a bit of jiggery pokery) can be used to add a splash of colour to the plot above.

Anyway, here’s the R code which will download the data and create this plot for you. I’ve wrapped up the Google Spreadsheet accessing code in a function which preferentially uses RGoogleDocs, if you have it installed. What you definitely need installed are the packages RCurl, ISOcodes, gdata, and reshape.

getPublicGoogleSpreadsheet <- function(key, na.strings = "NA", header = TRUE, check.names = TRUE, ...)
 if (suppressWarnings(require('RGoogleDocs', character.only=TRUE, quietly=TRUE))) {
  ws <- getWorksheets(publicWorksheet(key), NULL)
  retval <- sheetAsMatrix(ws[[1]], con = NULL, header = header, ...)
  if (check.names) names(retval) <- make.names(names(retval))
  return(retval)
 } else {
  require(RCurl) #a hackier method & not always guaranteed to work
  URL <- paste("https://docs.google.com/spreadsheet/ccc?key=", key, "&output=csv", sep="")
  myCsv <-getURL(URL, .opts=curlOptions(followlocation=TRUE, cookiefile="nosuchfile"), ssl.verifypeer=FALSE)
  read.csv(textConnection(myCsv), na.strings=na.strings, header = header, ...)
 }

LifeExp <- getPublicGoogleSpreadsheet( "0AhORuxOwZhGydFlsT3VlamppWmJHSzFCeG1JRWJyamc", stringsAsFactors=FALSE, na.strings="")

gCodes <- getPublicGoogleSpreadsheet( "0AonYZs4MzlZbdG9ZNmNEVzRZeUVGM2g3eHVDSVNsTnc", stringsAsFactors=FALSE, na.strings="") #2 letter country codes from the Guardian
LifeExp$Country2Letter <- gCodes$Code[match(tolower(LifeExp$Country), tolower(gCodes$Country))]
# Column 3 in the Guardian Country Codes sheets gives alternative names
gCodes[match("côte d'ivoire", tolower(gCodes$Country)),3] <- "Cote d'Ivoire" # add some corrections
gCodes[match("palestinian territory, occupied", tolower(gCodes$Country)),3] <- "Occupied Palestinian Territory"
gCodes[pmatch("moldova", tolower(gCodes$Country)),3] <- "Moldova"
gCodes[pmatch("taiwan", tolower(gCodes$Country)),3] <- "Taiwan"
unmatched <- is.na(LifeExp$Country2Letter)
LifeExp$Country2Letter[unmatched] <- gCodes$Code[match(tolower(LifeExp$Country[unmatched]), tolower(gCodes[,3]))]

require(ISOcodes) #Use ISO 3166 to turn 2 letter -< 3 digit (UN) codes
data('ISO_3166_1')
LifeExp$Country3Digit <- ISO_3166_1$Numeric[match(LifeExp$Country2Letter, ISO_3166_1$Alpha_2)]
LifeExp$Country3Digit[LifeExp$Country=='Sudan'] <- 736 # Use 2010 code for Sudan (changed to '729' in 2011)

#plot approx as in wikipedia
require(gdata) #to get population sizes
WorldPop <- read.xls("http://esa.un.org/wpp/excel-Data/DB02_Stock_Indicators/WPP2010_DB2_F01_TOTAL_POPULATION_BOTH_SEXES.XLS", stringsAsFactors=FALSE, pattern="Index", header=TRUE)
WorldPop <- data.frame(lapply(WorldPop, function(x) {tryCatch(as.numeric(gsub(' ','',x)), warning=function(e) x)}), stringsAsFactors=FALSE) # Remove spaces
require(reshape)
LE <- melt(LifeExp, id=c("Country", "Country3Digit", "Country2Letter", "Year", "Male.Female"))
LE2010 <- cast(LE, Country + Country2Letter + Country3Digit ~ variable+Male.Female | Year)[['2010']]
LE2010$Pop1000s <- WorldPop$X2010[match(as.numeric(LE2010$Country3Digit), WorldPop$Country.code)]

data(UN_M.49_Regions)
regions <- subset(UN_M.49_Regions, Parent=='001' & Code <= 150)
find.countries <- function(x) {r <- UN_M.49_Regions[UN_M.49_Regions$Code==x, 'Children']; if(length(r)) sapply(strsplit(r, ', ')[[1]], find.countries) else x}
regions$Countries <- sapply(regions$Code, function(x) paste(unlist(find.countries(x))))
LE2010$Region <- sapply(LE2010$Country3Digit, function(x) regions$Name[grep(x, regions$Countries)[1]])
LE2010$Region[LE2010$Country=='Taiwan'] <- 'Asia'
LE2010$Region[LE2010$Country=='Sudan'] <- 'Africa'
LE2010 <- LE2010[order(LE2010$Pop1000s, decreasing=TRUE),] #overlay larger with smaller 
region.colours <- c(Africa='yellow', Americas='magenta', Asia='red', Europe='green', Oceania='cyan') 
#require(cairoDevice); Cairo_svg("MFplot.svg", width=6.5, height=6.5) 
par(mgp=c(2.5,1,0),col.sub='grey', cex.sub=0.75)
plot(Life.expectancy_Female~Life.expectancy_Male, data=LE2010, pch=21, cex=(LE2010$Pop1000s*1000)^(1/3)/200, main=paste("Sex differences in lifespan from", nrow(LE2010), "countries"), sub="(Life expectancies from GBD 2010 [Lancet 380:2071]. Population sizes from UN WPP)", xlab='Male Life Expectancy at Birth (years)', ylab="Female Life Expectancy at Birth (years)", xlim=c(30,90), ylim=c(30,90), bg=region.colours[LE2010$Region])
abline(a=0,b=1, col='green')
legend(80,55, xjust=0.5, names(region.colours), pt.bg=region.colours, pch=21, pt.cex=1.5, bty='n')
text(80, 33, 'Size of points proportional to\ncube root of population size', cex=0.7)
labelshow <- c('Haiti', 'Japan', 'Afghanistan', 'Jordan')
text(LE2010[match(labelshow, LE2010$Country), c('Life.expectancy_Male', 'Life.expectancy_Female')]+rep(c(1.3,0), each=length(labelshow)), labelshow, adj=c(0,0.5), cex=0.7)

Summary

There’s more R code in this post than I intended. But basically, the options are:

  1. Data stored in http:// links
    1. Plain .csv files: use read.csv('http://etc')
    2. Zipped .csv files: simplest to download.file then look inside with unzip(list=TRUE) and finally use unz() to get the first file in the zip archive.
    3. Excel files: use read.xls from the gdata package. You might have to examine the file first to look for a word or two that indicates which row is the header.
  2. Data stored in https:// links
    1. Use RCurl, with something like read.csv(textConnection(getURL(***)). With the getURL function, you may need to provide .opts=curlOptions(followlocation=TRUE,cookiefile = "nosuchfile") and ssl.verifypeer=FALSE
    2. For Google Spreadsheets, try RGoogleDocs, using something like sheetAsMatrix(getWorksheets(publicWorksheet(KEY), NULL)[[1]])

The reshape package can then be helpful in restructuring the data into the right form for plotting or analysis.

References

 

Notes   [ + ]

a. I’m using this UN data as an example of text-file download, but there is also an API for retrieving these data without going via text files. However, this currently requires registration
b. You should be able to just copy this code and paste it straight into R: the ‘>’ symbol which may appear at the front of each line is content generated using css magic, so shouldn’t get copied when you select the code.
c. in my case, “http://esa.un.org/wpp/excel-Data/DB02_Stock_Indicators/WPP2010_DB2_F01_TOTAL_POPULATION_BOTH_SEXES.XLS“, which lists the populations of countries around the world, from 1950-2010

2 thoughts on “Reproducible plots of public data (Guardian Google spreadsheets, UNdata) using R

  1. Pingback: Informative graphics: male and female lifespan | A Scientific View

  2. Pingback: Somewhere else, part 47 | Freakonometrics

Leave a Reply

Your email address will not be published. Required fields are marked *