Informative graphics: male and female lifespan

Plot from wikipedia of male and female life expectancy at birth from CIA factbook data. The apparent 3D volumes of the bubbles are linearly proportional to their population

Last month I recorded an update to my Radio 4 programme on why women live longer than men (which you can still listen to online). A cut down version was broadcast on the BBC World Service. For this we wanted to include statistics for different countries. I was busy investigating official UN sources for these data, when I noticed that morning’s papers were reporting a relevant set of studies published that day in the Lancet . A couple of thoughts followed. Firstly, these new data indicate that in 2010, men had a higher life expectancy than women in only 2 countries: Afghanistan and Jordan . That’s in contrast to the UN data, which finds this effect in a small number of African countries too – suggesting the UN data could be inaccurate in these casesaThe difference can be seen by plotting the bottom 10 countries from the UN WPP 2010 data

which can be seen here  (click on “Show all countries by year”, then select 2005-2010).

Equivalent data from GDB2010: produced from code in my previous blog post

Compare this list with the equivalent from the GBD2010 study, on the right.

Potential reasons for the difference are discussed in pages 25-27 of the appendix to the paper, which is available freely online. Secondly, the data are given with uncertainty intervals for the first time, which got me thinking about the best way to illustrate the data. In particular, I’ve been thinking about how to improve the plot above (taken from the wikipedia page listing the life expectancy of different countries).

I think the general idea of the plot is a good one. Plotting the male value on one axis versus the female value on the other shows both absolute life expectancies, and the global advantage females now have over males – testament to underlying biology and the major advances in maternal care over the past 200 years .

Similarly, colouring the points by region (Africa, Europe, etc.) seems a good way of revealing patterns in the data. The exact classification might be a matter of debate, but there is a standard UN allocation of counties to regions, as well as the slightly different set of regions and subregions adopted in the GBD 2010 data (e.g. see Appendix A here).

No, the main problem I have with this plot is the use of size of the plotting symbols. It’s a common technique to use the area of each point to indicate the size of the dataset on which each point is based. That slight convention is broken in the wikipedia plot, where the volume (rather than area) of the symbol indicates the population size of each country. That’s justified on the basis that the points are meant to look like spheres (they don’t to me, especially since it’s not otherwise a 3D plot). But it does mean that both China (population 1341 million) and Kiribati (population 0.1 million) will fit on the same plot: if Kiribati is a point 1 pixel in diameter, then China would be (1341/0.1)1/3 = 24 pixels in diameter. If we switch to the more conventional use of the area to represent population size, then China would have to be (1341/0.1)1/2 = 116 pixels – blotting out most of the plot.

Regardless if area or volume are used, I still think it is a bit misleading when the plot symbols differ so much in size. The problem is with the larger ones. Firstly, their size obscures exactly where the point lies. Secondly, they could be misleadingly seen as points with a bigger variability in their estimates. The implication might be that (say) China has some areas where males outlive females (part of its symbol goes below the green line) whereas Angola doesn’t. That’s not true. In fact, the symbols are almost the opposite of a confidence ellipse: the countries with the largest populations are those which are likely to have the most precise estimates. Females almost certainly outlive males on average in China – lots of data there. In contrast, the data for Botswana (a small circle, definitely below the green line on the plot) is probably uncertain enough that the real point could be above the line. The recent GBD data includes uncertainty intervals, so uncertainties could in fact be incorporated somehow.

So why might it be a good idea to show more populous countries with bigger circles? I think it’s because the bigger circles are more visible – the size gives them an importance which is probably justified, given their larger population. Are there any ways to show this importance other than by using size? Well, one possibility is to use opacity: larger countries could appear solid, while smaller ones more translucent.

Life expectancy from 187 countries in 2010. Transparency of points is proportional to a country’s population size in 2010.

So to the right is my initial take on the plot, using the new GBD dataset, which is accessible from the Guardian data store, and can be accessed as detailed in my previous blog post.

Of course, there’s the issue of how exactly to chose opacity values that “correctly” reflect the weight of each point. You have to bear in mind that most computers use a scale which only allows 255 values for opacity, and the most translucent colours are unlikely to be visible with a simple glance. It’s the same problem of trying to find a scale that adequately shows the population size of both Kiribati and China. As a first pass, I’ve simply chosen a linear scale (the alpha value of the colour is proportional to population size), but truncated the top end, so that any country with a population over 100 million is shown completely opaque (no translucency). There are probably better ways of doing this.

Life expectancy from 187 countries in 2010 (with 95% uncertainty intervals). Transparency of points is proportional to a country’s population size in 2010.

There’s an additional advantage that we have now freed up the size of the points to be used in a different way – for instance to show the error in our estimates. The plot to the left shows uncertainty intervals, as X and Y bars which have the same translucency as the original points.

In the spirit of reproducible analysis, the footnotes provide some rather involved R code to produce these two plots using freely available online databThe following code produces both plots described in this post – It’s rather a lot of code because much of it is aimed at syncing up the names of the countries between different datasets. The bits that do the plotting are shown with a yellow background

getPublicGoogleSpreadsheet <- function(key, na.strings = "NA", header = TRUE, check.names = TRUE, ...)
 if (suppressWarnings(require('RGoogleDocs', character.only=TRUE, quietly=TRUE))) {
  ws <- getWorksheets(publicWorksheet(key), NULL)
  retval <- sheetAsMatrix(ws[[1]], con = NULL, header = header, ...)
  if (check.names) names(retval) <- make.names(names(retval))
  return(retval)
 } else {
  require(RCurl) #a hackier method & not always guaranteed to work
  URL <- paste("https://docs.google.com/spreadsheet/ccc?key=", key, "&output=csv", sep="")
  myCsv <-getURL(URL, .opts=curlOptions(followlocation=TRUE, cookiefile="nosuchfile"), ssl.verifypeer=FALSE)
  read.csv(textConnection(myCsv), na.strings=na.strings, header = header, ...)
 }

require(ISOcodes) #Use ISO 3166 to turn 2 letter -< 3 digit (UN) codes
data('ISO_3166_1') #have to add a few common names
ISO_3166_1[pmatch("iran", tolower(ISO_3166_1$Name)),'Common_name'] <- "Iran"
ISO_3166_1[pmatch("united kingdom", tolower(ISO_3166_1$Name)),'Common_name'] <- "UK"
ISO_3166_1[pmatch("united states", tolower(ISO_3166_1$Name)),'Common_name'] <- "USA"
ISO_3166_1[pmatch("syria", tolower(ISO_3166_1$Name)),'Common_name'] <- "Syria"
ISO_3166_1[pmatch("viet nam", tolower(ISO_3166_1$Name)),'Common_name'] <- "Vietnam"
ISO_3166_1[pmatch("tanzania", tolower(ISO_3166_1$Name)),'Common_name'] <- "Tanzania"
ISO_3166_1[pmatch("russian federation", tolower(ISO_3166_1$Name)),'Common_name'] <- "Russia"
ISO_3166_1[pmatch("myanmar", tolower(ISO_3166_1$Name)),'Common_name'] <- "Burma"
ISO_3166_1[pmatch("brunei", tolower(ISO_3166_1$Name)),'Common_name'] <- "Brunei"
ISO_3166_1[pmatch("congo, the democratic republic", tolower(ISO_3166_1$Name)),'Common_name'] <- "DR Congo"
ISO_3166_1[pmatch("lao", tolower(ISO_3166_1$Name)),'Common_name'] <- "Laos"
ISO_3166_1[pmatch("macedonia", tolower(ISO_3166_1$Name)),'Common_name'] <- "Macedonia"
ISO_3166_1[pmatch("korea, democratic people's republic of", tolower(ISO_3166_1$Name)),'Common_name'] <- "North Korea"
ISO_3166_1[pmatch("korea, republic of", tolower(ISO_3166_1$Name)),'Common_name'] <- "South Korea"
ISO_3166_1[pmatch("palestinian territory", tolower(ISO_3166_1$Name)),'Common_name'] <- "Palestine"
ISO_3166_1[pmatch("bahamas", tolower(ISO_3166_1$Name)),'Common_name'] <- "The Bahamas"
ISO_3166_1[pmatch("gambia", tolower(ISO_3166_1$Name)),'Common_name'] <- "The Gambia"

require(gdata) #to get population sizes
WorldPop <- read.xls("http://esa.un.org/wpp/excel-Data/DB02_Stock_Indicators/WPP2010_DB2_F01_TOTAL_POPULATION_BOTH_SEXES.XLS", stringsAsFactors=FALSE, pattern="Index", header=TRUE)
WorldPop <- data.frame(lapply(WorldPop, function(x) {tryCatch(as.numeric(gsub(' ','',x)), warning=function(e) x)}), stringsAsFactors=FALSE) # Remove spaces
GBDregions <- list( # from GBD2010 papers
  Asia = c(96, 392, 702, 410, 51, 31, 268, 398, 417, 496, 762, 795, 860, 156, 408, 158, 4, 50, 64, 356,
    524, 586, 104, 116, 360, 418, 458, 462, 608, 144, 764, 626, 704), 
  Australasia = c(36, 554), 
  Caribbean = c(28, 52, 84, 192, 212, 214, 308, 328, 332, 388, 662, 670, 740, 44, 780), 
  Europe = c(8, 70, 100, 191, 203, 348, 807, 499, 616, 642, 688, 703, 705, 112, 233, 428, 440, 498, 
    643, 804, 20, 40, 56, 196, 208, 246, 250, 276, 300, 352, 372, 376, 380, 442, 470, 528, 578, 620,
    724, 752, 756, 826),
  'Latin America' = c(68, 218, 604, 170, 188, 222, 320, 340, 484, 558, 591, 862, 32, 152, 858, 76, 600),
  'North Africa and Middle East' = c(12, 48, 818, 364, 368, 400, 414, 422, 434, 504, 512, 275, 634, 682,
    760, 788, 792, 784, 887),
  'North America' = c(124, 840),
  Oceania = c(583, 242, 296, 584, 598, 882, 90, 776, 548),
  'sub-Saharan Africa' = c(24, 140, 180, 226, 266, 178, 108, 174, 262, 232, 231, 404, 450, 454, 480, 
    508, 646, 690, 706, 736, 834, 800, 894, 72, 426, 516, 710, 748, 716, 204, 854, 120, 132, 148, 384, 
    288, 324,  624, 430, 466, 478, 562, 566, 678, 686, 694, 270, 768))
region.colours <- c("Other"="grey", "Asia"= "red" , "Australasia"="magenta", "Caribbean"= "cyan3", 
                    "Europe"  ="brown", "Latin America"="green3", "North Africa and Middle East"="darkorange",
                    "North America"="black", "Oceania" =  "blue", "sub-Saharan Africa" = "purple")
cutoff <- function(x, cutoff, upper=TRUE) if(upper) ifelse(x>cutoff, cutoff,x) else ifelse(x<cutoff, cutoff,x)


#plot without error bars - use the Guardian Data
LifeExp <- getPublicGoogleSpreadsheet( "0AhORuxOwZhGydFlsT3VlamppWmJHSzFCeG1JRWJyamc", stringsAsFactors=FALSE, na.strings="")
gCodes <- getPublicGoogleSpreadsheet( "0AonYZs4MzlZbdG9ZNmNEVzRZeUVGM2g3eHVDSVNsTnc", stringsAsFactors=FALSE, na.strings="") #2 letter country codes from the Guardian
LifeExp$Country2Letter <- gCodes$Code[match(tolower(LifeExp$Country), tolower(gCodes$Country))]
# Column 3 in the Guardian Country Codes sheets gives alternative names
gCodes[match("côte d'ivoire", tolower(gCodes$Country)),3] <- "Cote d'Ivoire" # add some corrections
gCodes[match("palestinian territory, occupied", tolower(gCodes$Country)),3] <- "Occupied Palestinian Territory"
gCodes[pmatch("moldova", tolower(gCodes$Country)),3] <- "Moldova"
gCodes[pmatch("taiwan", tolower(gCodes$Country)),3] <- "Taiwan"
unmatched <- is.na(LifeExp$Country2Letter)
LifeExp$Country2Letter[unmatched] <- gCodes$Code[match(tolower(LifeExp$Country[unmatched]), tolower(gCodes[,3]))]
LifeExp$Country3Digit <- ISO_3166_1$Numeric[match(LifeExp$Country2Letter, ISO_3166_1$Alpha_2)]
LifeExp$Country3Digit[LifeExp$Country=='Sudan'] <- 736 # Use 2010 code for Sudan (changed to '729' in 2011)
require(reshape)
LE <- melt(LifeExp, id=c("Country", "Country3Digit", "Country2Letter", "Year", "Male.Female"))
LE2010 <- cast(LE, Country + Country2Letter + Country3Digit ~ variable+Male.Female | Year)[['2010']]
LE2010$Pop1000s <- WorldPop$X2010[match(as.numeric(LE2010$Country3Digit), WorldPop$Country.code)]
LE2010$Region <- as.character(stack(GBDregions)[match(as.numeric(LE2010$Country3Digit), stack(GBDregions)$value),'ind'])

#require(cairoDevice); Cairo_svg("MFplot.svg", width=5, height=5) par(mgp=c(2.5,1,0),col.sub='grey', cex.sub=0.75) LE2010$col <- rgb(t(col2rgb(region.colours)[,LE2010$Region]), alpha=cutoff(LE2010$Pop1000s/100000*255,255), maxColorValue =255) plot(Life.expectancy_Female~Life.expectancy_Male, data=LE2010, pch=19, main=paste("Sex differences in lifespan from", nrow(LE2010), "countries"), sub="(Life expectancies from GBD 2010 [Lancet 380:2071]. Population sizes from UN WPP)", xlab='Male Life Expectancy at Birth (years)', ylab="Female Life Expectancy at Birth (years)", xlim=c(30,90), ylim=c(30,90), col=LE2010$col) abline(a=0,b=1, col='green') legend(75,52, xjust=0.5, names(region.colours), col=region.colours, pch=19, pt.cex=1.5, bty='n', y.intersp=1.2) labelshow <- c('Haiti', 'Japan', 'Afghanistan', 'Jordan') text(LE2010[match(labelshow, LE2010$Country), c('Life.expectancy_Male', 'Life.expectancy_Female')]+rep(c(1.3,0), each=length(labelshow)), labelshow, adj=c(0,0.5), cex=0.7) #dev.off()
#Now the plot with error bars. Get data from wikipedia GBD listing, which has uncertainty intervals require(XML) TextTab <- readHTMLTable("http://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy", which=3) LifeExp <- data.frame(Name=as.character(TextTab[[1]]), Male = as.numeric(as.character(TextTab[[2]])), MaleLci = as.numeric(sub('\\(([0-9.]+)-([0-9.]+)\\)', '\\1', TextTab[[3]])), MaleUci = as.numeric(sub('\\(([0-9.]+)-([0-9.]+)\\)', '\\2', TextTab[[3]])), Female = as.numeric(as.character(TextTab[[4]])), FemaleLci = as.numeric(sub('\\(([0-9.]+)-([0-9.]+)\\)', '\\1', TextTab[[5]])), FemaleUci = as.numeric(sub('\\(([0-9.]+)-([0-9.]+)\\)', '\\2', TextTab[[5]])), stringsAsFactors=FALSE) #countries with accents get mangled - reset them LifeExp[grep("d'Ivoire$", LifeExp$Name), "Name"] <- "Côte d'Ivoire" LifeExp[grep("ncipe$", LifeExp$Name), "Name"] <- "Sao Tome and Principe" LifeExp$Country3Digit <- ISO_3166_1$Numeric[match(tolower(LifeExp$Name), tolower(ISO_3166_1$Official_name))] #match official name first (unlikely) LifeExp$Country3Digit <- ifelse(is.na(LifeExp$Country3Digit), ISO_3166_1$Numeric[match(tolower(LifeExp$Name), tolower(ISO_3166_1$Name))], LifeExp$Country3Digit) #match usual name next LifeExp$Country3Digit <- ifelse(is.na(LifeExp$Country3Digit), ISO_3166_1$Numeric[match(tolower(LifeExp$Name), tolower(ISO_3166_1$Common_name))], LifeExp$Country3Digit) LE2010 <- subset(LifeExp, !is.na(Country3Digit)) LE2010$Pop1000s <- WorldPop$X2010[match(as.numeric(LE2010$Country3Digit), WorldPop$Country.code)] LE2010$Region <- as.character(stack(GBDregions)[match(as.numeric(LE2010$Country3Digit), stack(GBDregions)$value),'ind']) LE2010$col <- rgb(t(col2rgb(region.colours)[,LE2010$Region]), alpha=cutoff(LE2010$Pop1000s/100000*255,255), maxColorValue =255)
#require(cairoDevice); Cairo_svg("MFwithErrors.svg", width=5, height=5) par(mar=c(5,3,1,0.7), mgp=c(2.5,1,0), col.sub='grey', cex.sub=0.75) plot(LE2010$Male, LE2010$Female, xlim=c(28,90), ylim=c(28,90), xlab="Male Life Expectancy at Birth (years)", , sub="(Life expectancies from GBD 2010 [Lancet 380:2071]. Population sizes from UN WPP)", ylab="", tck=0.01, xaxs='i', yaxs='i', yaxt="n", bty="l", col=LE2010$col, pch=19) abline(a=0,b=1, col="green", lty=3) legend(75,50, xjust=0.5, names(region.colours), col=region.colours, pch=19, pt.cex=1.5, bty='n', y.intersp=1.2) axis(2, mgp=c(0,0.3,0), tck=0.01) title(ylab="Female Life Expectancy at Birth (years)", line=1.7) arrows(LE2010$MaleLci, LE2010$Female, LE2010$MaleUci, LE2010$Female, angle=90, length=0.01, col=LE2010$col, code=3) arrows(LE2010$Male, LE2010$FemaleLci, LE2010$Male, LE2010$FemaleUci, angle=90, length=0.01, col=LE2010$col, code=3) points(LE2010$Male, LE2010$Female, pch=16, cex=0.1) showC <- c("Haiti", "China", "USA", "Afghanistan", "Nigeria", "India", "Canada", "UK", "Ethiopia", "Jordan", "Egypt", "Central African Republic") idC <- match(showC, LE2010$Name) text(LE2010$Male[idC]+0.3, LE2010$Female[idC]-1, LE2010$Name[idC], col=region.colours[LE2010$Region[idC]], pos=4, offset=0, cex=0.55) showC <- c("Brazil", "Russia", "Japan", "South Korea", "Vietnam", "Indonesia") idC <- match(showC, LE2010$Name) text(LE2010$Male[idC]-0.4, LE2010$Female[idC]+0.5, LE2010$Name[idC], col=region.colours[LE2010$Region[idC]], pos=2, offset=0, cex=0.55) #dev.off()

.

References

Notes   [ + ]

a. The difference can be seen by plotting the bottom 10 countries from the UN WPP 2010 data

which can be seen here  (click on “Show all countries by year”, then select 2005-2010).

Equivalent data from GDB2010: produced from code in my previous blog post

Compare this list with the equivalent from the GBD2010 study, on the right.

Potential reasons for the difference are discussed in pages 25-27 of the appendix to the paper, which is available freely online

b. The following code produces both plots described in this post – It’s rather a lot of code because much of it is aimed at syncing up the names of the countries between different datasets. The bits that do the plotting are shown with a yellow background

getPublicGoogleSpreadsheet <- function(key, na.strings = "NA", header = TRUE, check.names = TRUE, ...)
 if (suppressWarnings(require('RGoogleDocs', character.only=TRUE, quietly=TRUE))) {
  ws <- getWorksheets(publicWorksheet(key), NULL)
  retval <- sheetAsMatrix(ws[[1]], con = NULL, header = header, ...)
  if (check.names) names(retval) <- make.names(names(retval))
  return(retval)
 } else {
  require(RCurl) #a hackier method & not always guaranteed to work
  URL <- paste("https://docs.google.com/spreadsheet/ccc?key=", key, "&output=csv", sep="")
  myCsv <-getURL(URL, .opts=curlOptions(followlocation=TRUE, cookiefile="nosuchfile"), ssl.verifypeer=FALSE)
  read.csv(textConnection(myCsv), na.strings=na.strings, header = header, ...)
 }

require(ISOcodes) #Use ISO 3166 to turn 2 letter -< 3 digit (UN) codes
data('ISO_3166_1') #have to add a few common names
ISO_3166_1[pmatch("iran", tolower(ISO_3166_1$Name)),'Common_name'] <- "Iran"
ISO_3166_1[pmatch("united kingdom", tolower(ISO_3166_1$Name)),'Common_name'] <- "UK"
ISO_3166_1[pmatch("united states", tolower(ISO_3166_1$Name)),'Common_name'] <- "USA"
ISO_3166_1[pmatch("syria", tolower(ISO_3166_1$Name)),'Common_name'] <- "Syria"
ISO_3166_1[pmatch("viet nam", tolower(ISO_3166_1$Name)),'Common_name'] <- "Vietnam"
ISO_3166_1[pmatch("tanzania", tolower(ISO_3166_1$Name)),'Common_name'] <- "Tanzania"
ISO_3166_1[pmatch("russian federation", tolower(ISO_3166_1$Name)),'Common_name'] <- "Russia"
ISO_3166_1[pmatch("myanmar", tolower(ISO_3166_1$Name)),'Common_name'] <- "Burma"
ISO_3166_1[pmatch("brunei", tolower(ISO_3166_1$Name)),'Common_name'] <- "Brunei"
ISO_3166_1[pmatch("congo, the democratic republic", tolower(ISO_3166_1$Name)),'Common_name'] <- "DR Congo"
ISO_3166_1[pmatch("lao", tolower(ISO_3166_1$Name)),'Common_name'] <- "Laos"
ISO_3166_1[pmatch("macedonia", tolower(ISO_3166_1$Name)),'Common_name'] <- "Macedonia"
ISO_3166_1[pmatch("korea, democratic people's republic of", tolower(ISO_3166_1$Name)),'Common_name'] <- "North Korea"
ISO_3166_1[pmatch("korea, republic of", tolower(ISO_3166_1$Name)),'Common_name'] <- "South Korea"
ISO_3166_1[pmatch("palestinian territory", tolower(ISO_3166_1$Name)),'Common_name'] <- "Palestine"
ISO_3166_1[pmatch("bahamas", tolower(ISO_3166_1$Name)),'Common_name'] <- "The Bahamas"
ISO_3166_1[pmatch("gambia", tolower(ISO_3166_1$Name)),'Common_name'] <- "The Gambia"

require(gdata) #to get population sizes
WorldPop <- read.xls("http://esa.un.org/wpp/excel-Data/DB02_Stock_Indicators/WPP2010_DB2_F01_TOTAL_POPULATION_BOTH_SEXES.XLS", stringsAsFactors=FALSE, pattern="Index", header=TRUE)
WorldPop <- data.frame(lapply(WorldPop, function(x) {tryCatch(as.numeric(gsub(' ','',x)), warning=function(e) x)}), stringsAsFactors=FALSE) # Remove spaces
GBDregions <- list( # from GBD2010 papers
  Asia = c(96, 392, 702, 410, 51, 31, 268, 398, 417, 496, 762, 795, 860, 156, 408, 158, 4, 50, 64, 356,
    524, 586, 104, 116, 360, 418, 458, 462, 608, 144, 764, 626, 704), 
  Australasia = c(36, 554), 
  Caribbean = c(28, 52, 84, 192, 212, 214, 308, 328, 332, 388, 662, 670, 740, 44, 780), 
  Europe = c(8, 70, 100, 191, 203, 348, 807, 499, 616, 642, 688, 703, 705, 112, 233, 428, 440, 498, 
    643, 804, 20, 40, 56, 196, 208, 246, 250, 276, 300, 352, 372, 376, 380, 442, 470, 528, 578, 620,
    724, 752, 756, 826),
  'Latin America' = c(68, 218, 604, 170, 188, 222, 320, 340, 484, 558, 591, 862, 32, 152, 858, 76, 600),
  'North Africa and Middle East' = c(12, 48, 818, 364, 368, 400, 414, 422, 434, 504, 512, 275, 634, 682,
    760, 788, 792, 784, 887),
  'North America' = c(124, 840),
  Oceania = c(583, 242, 296, 584, 598, 882, 90, 776, 548),
  'sub-Saharan Africa' = c(24, 140, 180, 226, 266, 178, 108, 174, 262, 232, 231, 404, 450, 454, 480, 
    508, 646, 690, 706, 736, 834, 800, 894, 72, 426, 516, 710, 748, 716, 204, 854, 120, 132, 148, 384, 
    288, 324,  624, 430, 466, 478, 562, 566, 678, 686, 694, 270, 768))
region.colours <- c("Other"="grey", "Asia"= "red" , "Australasia"="magenta", "Caribbean"= "cyan3", 
                    "Europe"  ="brown", "Latin America"="green3", "North Africa and Middle East"="darkorange",
                    "North America"="black", "Oceania" =  "blue", "sub-Saharan Africa" = "purple")
cutoff <- function(x, cutoff, upper=TRUE) if(upper) ifelse(x>cutoff, cutoff,x) else ifelse(x<cutoff, cutoff,x)


#plot without error bars - use the Guardian Data
LifeExp <- getPublicGoogleSpreadsheet( "0AhORuxOwZhGydFlsT3VlamppWmJHSzFCeG1JRWJyamc", stringsAsFactors=FALSE, na.strings="")
gCodes <- getPublicGoogleSpreadsheet( "0AonYZs4MzlZbdG9ZNmNEVzRZeUVGM2g3eHVDSVNsTnc", stringsAsFactors=FALSE, na.strings="") #2 letter country codes from the Guardian
LifeExp$Country2Letter <- gCodes$Code[match(tolower(LifeExp$Country), tolower(gCodes$Country))]
# Column 3 in the Guardian Country Codes sheets gives alternative names
gCodes[match("côte d'ivoire", tolower(gCodes$Country)),3] <- "Cote d'Ivoire" # add some corrections
gCodes[match("palestinian territory, occupied", tolower(gCodes$Country)),3] <- "Occupied Palestinian Territory"
gCodes[pmatch("moldova", tolower(gCodes$Country)),3] <- "Moldova"
gCodes[pmatch("taiwan", tolower(gCodes$Country)),3] <- "Taiwan"
unmatched <- is.na(LifeExp$Country2Letter)
LifeExp$Country2Letter[unmatched] <- gCodes$Code[match(tolower(LifeExp$Country[unmatched]), tolower(gCodes[,3]))]
LifeExp$Country3Digit <- ISO_3166_1$Numeric[match(LifeExp$Country2Letter, ISO_3166_1$Alpha_2)]
LifeExp$Country3Digit[LifeExp$Country=='Sudan'] <- 736 # Use 2010 code for Sudan (changed to '729' in 2011)
require(reshape)
LE <- melt(LifeExp, id=c("Country", "Country3Digit", "Country2Letter", "Year", "Male.Female"))
LE2010 <- cast(LE, Country + Country2Letter + Country3Digit ~ variable+Male.Female | Year)[['2010']]
LE2010$Pop1000s <- WorldPop$X2010[match(as.numeric(LE2010$Country3Digit), WorldPop$Country.code)]
LE2010$Region <- as.character(stack(GBDregions)[match(as.numeric(LE2010$Country3Digit), stack(GBDregions)$value),'ind'])

#require(cairoDevice); Cairo_svg("MFplot.svg", width=5, height=5) par(mgp=c(2.5,1,0),col.sub='grey', cex.sub=0.75) LE2010$col <- rgb(t(col2rgb(region.colours)[,LE2010$Region]), alpha=cutoff(LE2010$Pop1000s/100000*255,255), maxColorValue =255) plot(Life.expectancy_Female~Life.expectancy_Male, data=LE2010, pch=19, main=paste("Sex differences in lifespan from", nrow(LE2010), "countries"), sub="(Life expectancies from GBD 2010 [Lancet 380:2071]. Population sizes from UN WPP)", xlab='Male Life Expectancy at Birth (years)', ylab="Female Life Expectancy at Birth (years)", xlim=c(30,90), ylim=c(30,90), col=LE2010$col) abline(a=0,b=1, col='green') legend(75,52, xjust=0.5, names(region.colours), col=region.colours, pch=19, pt.cex=1.5, bty='n', y.intersp=1.2) labelshow <- c('Haiti', 'Japan', 'Afghanistan', 'Jordan') text(LE2010[match(labelshow, LE2010$Country), c('Life.expectancy_Male', 'Life.expectancy_Female')]+rep(c(1.3,0), each=length(labelshow)), labelshow, adj=c(0,0.5), cex=0.7) #dev.off()
#Now the plot with error bars. Get data from wikipedia GBD listing, which has uncertainty intervals require(XML) TextTab <- readHTMLTable("http://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy", which=3) LifeExp <- data.frame(Name=as.character(TextTab[[1]]), Male = as.numeric(as.character(TextTab[[2]])), MaleLci = as.numeric(sub('\\(([0-9.]+)-([0-9.]+)\\)', '\\1', TextTab[[3]])), MaleUci = as.numeric(sub('\\(([0-9.]+)-([0-9.]+)\\)', '\\2', TextTab[[3]])), Female = as.numeric(as.character(TextTab[[4]])), FemaleLci = as.numeric(sub('\\(([0-9.]+)-([0-9.]+)\\)', '\\1', TextTab[[5]])), FemaleUci = as.numeric(sub('\\(([0-9.]+)-([0-9.]+)\\)', '\\2', TextTab[[5]])), stringsAsFactors=FALSE) #countries with accents get mangled - reset them LifeExp[grep("d'Ivoire$", LifeExp$Name), "Name"] <- "Côte d'Ivoire" LifeExp[grep("ncipe$", LifeExp$Name), "Name"] <- "Sao Tome and Principe" LifeExp$Country3Digit <- ISO_3166_1$Numeric[match(tolower(LifeExp$Name), tolower(ISO_3166_1$Official_name))] #match official name first (unlikely) LifeExp$Country3Digit <- ifelse(is.na(LifeExp$Country3Digit), ISO_3166_1$Numeric[match(tolower(LifeExp$Name), tolower(ISO_3166_1$Name))], LifeExp$Country3Digit) #match usual name next LifeExp$Country3Digit <- ifelse(is.na(LifeExp$Country3Digit), ISO_3166_1$Numeric[match(tolower(LifeExp$Name), tolower(ISO_3166_1$Common_name))], LifeExp$Country3Digit) LE2010 <- subset(LifeExp, !is.na(Country3Digit)) LE2010$Pop1000s <- WorldPop$X2010[match(as.numeric(LE2010$Country3Digit), WorldPop$Country.code)] LE2010$Region <- as.character(stack(GBDregions)[match(as.numeric(LE2010$Country3Digit), stack(GBDregions)$value),'ind']) LE2010$col <- rgb(t(col2rgb(region.colours)[,LE2010$Region]), alpha=cutoff(LE2010$Pop1000s/100000*255,255), maxColorValue =255)
#require(cairoDevice); Cairo_svg("MFwithErrors.svg", width=5, height=5) par(mar=c(5,3,1,0.7), mgp=c(2.5,1,0), col.sub='grey', cex.sub=0.75) plot(LE2010$Male, LE2010$Female, xlim=c(28,90), ylim=c(28,90), xlab="Male Life Expectancy at Birth (years)", , sub="(Life expectancies from GBD 2010 [Lancet 380:2071]. Population sizes from UN WPP)", ylab="", tck=0.01, xaxs='i', yaxs='i', yaxt="n", bty="l", col=LE2010$col, pch=19) abline(a=0,b=1, col="green", lty=3) legend(75,50, xjust=0.5, names(region.colours), col=region.colours, pch=19, pt.cex=1.5, bty='n', y.intersp=1.2) axis(2, mgp=c(0,0.3,0), tck=0.01) title(ylab="Female Life Expectancy at Birth (years)", line=1.7) arrows(LE2010$MaleLci, LE2010$Female, LE2010$MaleUci, LE2010$Female, angle=90, length=0.01, col=LE2010$col, code=3) arrows(LE2010$Male, LE2010$FemaleLci, LE2010$Male, LE2010$FemaleUci, angle=90, length=0.01, col=LE2010$col, code=3) points(LE2010$Male, LE2010$Female, pch=16, cex=0.1) showC <- c("Haiti", "China", "USA", "Afghanistan", "Nigeria", "India", "Canada", "UK", "Ethiopia", "Jordan", "Egypt", "Central African Republic") idC <- match(showC, LE2010$Name) text(LE2010$Male[idC]+0.3, LE2010$Female[idC]-1, LE2010$Name[idC], col=region.colours[LE2010$Region[idC]], pos=4, offset=0, cex=0.55) showC <- c("Brazil", "Russia", "Japan", "South Korea", "Vietnam", "Indonesia") idC <- match(showC, LE2010$Name) text(LE2010$Male[idC]-0.4, LE2010$Female[idC]+0.5, LE2010$Name[idC], col=region.colours[LE2010$Region[idC]], pos=2, offset=0, cex=0.55) #dev.off()

Leave a Reply

Your email address will not be published. Required fields are marked *