Data wrangling with R (3): EU election turnout

Background

This is mainly an example of importing and working with JSON (JavaScript Object Notation) data in R, with a little tidyverse stuff (merging and reshaping data frames) thrown in.

The data

The data are turnout data for the European parliament elections, and are available here. I’m interested in this data for an elicitation problem set by Good Judgement Open (which I’ll write about later), to do with predicting the turnout for 2019.

JSON data

Here’s an example of some JSON data (a section of the JSON file for turnout for 2014):

{
    "ballot": {
        "year": 2014,
        "type": "eu-wide"
    },
    "update": {
        "labelId": null,
        "resultStatus": "constitutive",
        "time": null
    },
    "results": [
        {
            "divisionId": "EU",
            "percent": "42.61"
        },
        {
            "divisionId": "BE",
            "percent": "89.64"
        },
        {
            "divisionId": "DK",
            "percent": "56.32"
        },

You can read more about JSON data format here. It’s easy to import JSON files into R with the jsonlite package:

JsonData <- jsonlite::fromJSON("http://www.europarl.europa.eu/website/election-results/data/turnout/year/2014.json")
str(JsonData)
## List of 3
##  $ ballot :List of 2
##   ..$ year: int 2014
##   ..$ type: chr "eu-wide"
##  $ update :List of 3
##   ..$ labelId     : NULL
##   ..$ resultStatus: chr "constitutive"
##   ..$ time        : NULL
##  $ results:'data.frame': 29 obs. of  2 variables:
##   ..$ divisionId: chr [1:29] "EU" "BE" "DK" "DE" ...
##   ..$ percent   : chr [1:29] "42.61" "89.64" "56.32" "48.10" ...

From the str() command, we can see the file has been imported into a list format. The data I want are the turnouts for each country, and we can see that this is given in JsonData$results, conveniently already in a data frame.

head(JsonData$results)
##   divisionId percent
## 1         EU   42.61
## 2         BE   89.64
## 3         DK   56.32
## 4         DE   48.10
## 5         IE   52.44
## 6         FR   42.43

I’ll use a loop to load in the data for all the elections, and join them together, manually adding in a year column.

library(tidyverse)
electionUrl <- "http://www.europarl.europa.eu/website/election-results/data/turnout/year/"
allyears <- 1979 + 5*(0:7)
turnout <- NULL
for(electionyear in allyears){
  JsonData <- jsonlite::fromJSON(paste0(electionUrl,
                                        electionyear,
                                        ".json" )
                                 )
  JsonData$results %>% 
    mutate(year = electionyear) %>%
    bind_rows(turnout) ->
    turnout
}
turnout <- mutate(turnout, percent = as.numeric(percent))

Adding in relative population sizes

I also want to relative sizes of the electorates in each country. I found some data at Eurostat (population as a percentage of EU28 population). I used the 2018 values, manually added the two-letter country codes. The data give percentage of total population, rather than percentage of total electorate, but hopefully the error will be small (and I don’t actually know how ‘overall’ turnout was calculated.)

I did the editing within Excel; I’ve put the file on my website (I don’t think read_excel() currently works with urls.

population <- readxl::read_excel("population.xlsx")
head(population)
## # A tibble: 6 x 3
##   divisionId country  proportion
##   <chr>      <chr>         <dbl>
## 1 EU         EU            100  
## 2 BE         Belgium         2.2
## 3 BG         Bulgaria        1.4
## 4 CZ         Czechia         2.1
## 5 DK         Denmark         1.1
## 6 DE         Germany        16.2

Now I can join the two data frames together.

turnout %>% 
  inner_join(population, by = "divisionId")  %>%
  mutate(proportion = proportion / 100) ->
  elections

I’ll do a quick check that weighting by population size gives the correct overall turnout

election2014 <-  filter(elections, year == 2014, country != "EU") 
sum(election2014$percent * election2014$proportion)
## [1] 42.62418
filter(elections, year == 2014, country == "EU") %>%
  pull(percent)
## [1] 42.61

Close enough…

Before plotting, I’ll calculate a ‘prediction’ for 2019, assuming the UK doesn’t vote (although who knows what will happen…), turnout in all other countries is unchanged from 2014, and the weighting hasn’t changed.

elections %>%
  filter(year == 2014, country != "EU", country != "United Kingdom") %>%
  select(percent, proportion) ->
  Brexit

prediction <- sum(Brexit$percent * Brexit$proportion)/
  (sum(Brexit$proportion))

Plotting election turnout

Now a plot to show this all. I’ll try using transparency (alpha) to represent the proportion of the total EU electorate within each country. I think this is good enough for picking out the larger countries. I’ll use a dashed line to show the ‘prediction’ for 2019.

elections %>%
  filter(country != "EU") %>%
  ggplot(aes(x = year, y = percent, 
             alpha = proportion,
             group = country)) +
  geom_line() + 
  scale_x_continuous(breaks = 1979 + 5*(0:8)) +
  geom_line(data = filter(elections, country == "EU"),
            colour = "blue",
            lwd = 1.1,
            alpha = 1) +
  annotate("segment", x = 2014, xend = 2019, 
           y = 42.61, yend = prediction, 
           colour = "blue",
           linetype = "dashed")+
  annotate("text", x = c(1977.5, 1978), y = c(85.5, 32), 
           label = c("Italy", "UK"))+
  labs(alpha = "Proportion of EU\npopulation (2018)")

Calculating and plotting relative changes in turnout

One other thing I’d like to look at is relative changes in turnout ((turnout in election \(n+1\) - turnout in election \(n\)) / turnout in election \(n\)). I don’t know if there’s a ‘tidy’ way to do this, but I can use the diff() function if I put the data in an ‘untidy’ (wide) format. I’ll use spread() to get the data in wide format, then use diff() to compute the changes, then use gather() to get the data back into long format for plotting.

elections %>% spread(year, percent) ->
  untidyElections

changes <- diff(t(as.matrix(untidyElections[, 4:11]))) /
  t(as.matrix(untidyElections[, 4:10]))

untidyElections %>%
  select(country, proportion) %>%
  bind_cols(as.data.frame(t(changes))) %>%
  gather(year, percentChange, -country, -proportion) %>%
  mutate(year = as.numeric(year)) ->
  turnoutChanges

I’ll just plot the changes for the larger countries, excluding the UK. To see which countries are largest:

turnoutChanges %>%
  filter(year == 2014) %>%
  arrange(desc(proportion)) %>%
  head(n = 8)
##          country proportion year percentChange
## 1             EU      1.000 2014  -0.008377938
## 2        Germany      0.162 2014   0.111624682
## 3         France      0.131 2014   0.044302240
## 4 United Kingdom      0.129 2014   0.025936599
## 5          Italy      0.118 2014  -0.120368947
## 6          Spain      0.091 2014  -0.023623802
## 7         Poland      0.074 2014  -0.028536486
## 8        Romania      0.038 2014   0.172388869

I’ll make a list of the countries (in order of population size)

turnoutChanges %>%
  filter(year == 2014) %>%
  arrange(desc(proportion)) %>%
  pull(country) ->
  countrylist 

and go as far as Spain in the plot, excluding the UK.

turnoutChanges %>%
  filter(country %in% countrylist[1:6], country != "United Kingdom") %>%
  ggplot(aes(x = year, y = percentChange,
             colour = country)) +
  geom_line() + 
  scale_x_continuous(breaks = 1984 + 5*(0:6)) +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "relative change in turnout")

That’s enough for now. In a later post, I’ll try forecasting the 2019 turnout as an expert elicitation problem, and consider how these data will help.

Related