Section 4 Comparing distributions

In this section, we’ll consider two ways of comparing distributions: box plots and histograms. Box plots are usually the best option for comparing multiple distributions in one plot, but histograms may be preferable if there are only two distributions to compare. When comparing histograms, it’s likely we’ll need to do some editing to get a nice looking and informative figure.

We consider the mvscores dataset from the MAS6005 library. The aim is to compare test match batting scores for one player, Michael Vaughan, between the matches he played as captain, and the matches where he was not captain.

4.1 Box plots

If we want to compare scores in the two periods, we can use a box plot:

library(MAS6005)
attach(mvscores)
boxplot(runs ~ captain,
        ylab = "runs scored",
        names = c("Matches not played as captain", 
                  "Matches played as captain"))
Comparing two distributions with a box plot. You may need to explain to your reader how to interpret a box plot!

Figure 4.1: Comparing two distributions with a box plot. You may need to explain to your reader how to interpret a box plot!

In the mvscores data, we have two 2-level factors as independent variables (captain and innings), so we may wish to compare runs scored for the \(2\times 2\) groups that these define. Batting can be more difficult in a second innings, due to wear of the pitch. For those with no interest in/knowledge of cricket, the idea here is to think of captain as the main factor of interest, with innings as a blocking variable: we may be more interested in comparing captain/not captain scores within innings the same innings type than between different innings types.

If we specify a suitable formula in the boxplot command, we can display distributions for the four groups, but we should think about the order we specify the factors. We have two choices:

boxplot(runs ~ captain + innings)
(a): Here we've specified the factors in the order `captain + innings`, so `innings` is fixed at `first` in the first two distributions

Figure 4.2: (a): Here we’ve specified the factors in the order captain + innings, so innings is fixed at first in the first two distributions

boxplot(runs ~ innings + captain)
(b): Here we've specified the factors in the order `innings + captain`, so `captain` is fixed at `no` in the first two distributions

Figure 4.2: (b): Here we’ve specified the factors in the order innings + captain, so captain is fixed at no in the first two distributions

Arguably, the first choice is more useful: if we think of innings as a blocking variable, so that we want to make comparisons within blocks: the first option plots the appropriate distributions next to each other.

Of course, we now have more tidying to do: we need proper labels for each box. We may need to use line breaks to fit all the labels in, which has a knock-on effect of needing to create more space between the tick marks and the axes labels. (Use \n inside an R string variable to produce a line break.)

par(mgp = c(3, 2, 0))
boxplot(runs ~ captain + innings,
        ylab = "runs scored",
        names = c("First innings, \nnot captain",
                  "First innings, \nas captain",
                  "Second innings, \nnot captain",
                  "Second innings, \nas captain"
                  
        )
)
Tidying up the box plot, to give each distribution a suitable label.

Figure 4.3: Tidying up the box plot, to give each distribution a suitable label.

4.2 Histograms

If we ignore innings, so that we just have have two groups, we could instead plot two histograms to compare the distributions. We’ll use the mfrow option to draw two plots together, but you could handle this differently within an Rmarkdown/knitr document.

par(mfrow = c(1, 2))
hist(runs[captain=="no"], 
     xlab = "Runs scored, not playing as captain", 
     main = "")
hist(runs[captain=="yes"], 
     xlab = "Runs scored, playing as captain", 
     main = "")
A first attempt to compare two distributions using histograms

Figure 4.4: A first attempt to compare two distributions using histograms

Note that we’ve supressed the title with the argument main = "". R histograms will include titles by default, but these are usually worth removing, as a proper axis label and caption should be sufficient.

Two things stand out as needing improvement:

  • the two histograms aren’t drawn on the same scale, with the same axes limits;
  • the bar widths are different; this looks untidy and makes comparison of the two histograms more difficult.

Although perhaps not obvious for the plot, the number of observations in the two periods isn’t the same (61 and 86), so it would also be better to scale the total area shown in each histogram to 1.

When plotting a histogram of sample data, we might think about what shape we expect the population distribution to have, and then choose a bar width to give the best impression of this shape. Here, we actually have population data rather than sample data, but in the context of cricket scores, bin widths of size 10 would be a reasonable choice.

par(mfrow = c(1, 2))
hist(runs[captain=="no"], xlab = "Runs scored, not playing as captain", 
     main = "",
     breaks = 10 * 0:20, ylim = c(0, .03), prob = TRUE)
hist(runs[captain=="yes"], xlab = "Runs scored, playing as captain", main = "",
     breaks = 10 * 0:20, ylim = c(0, .03), prob = TRUE)

Another possibility is to place one histogram underneath the other one. This can make it a little easier to see a shift in location, although you may need to make the figure quite large, otherwise the two histograms will look cramped:

par(mfrow = c(2, 1))
hist(runs[captain=="no"], xlab = "Runs scored, not playing as captain", 
     main = "",
     breaks = 10 * 0:20, ylim = c(0, .03), prob = TRUE)
hist(runs[captain=="yes"], xlab = "Runs scored, playing as captain", main = "",
     breaks = 10 * 0:20, ylim = c(0, .03), prob = TRUE)

This can be handled differently in ggplot2 using ‘facets’. ggplot2 also makes it easy to overlay histograms (and density plots), but this doesn’t always give a good effect.