Section 4 Comparing distributions
In this section, we’ll consider two ways of comparing distributions: box plots and histograms. Box plots are usually the best option for comparing multiple distributions in one plot, but histograms may be preferable if there are only two distributions to compare. When comparing histograms, it’s likely we’ll need to do some editing to get a nice looking and informative figure.
We consider the
mvscores dataset from the
MAS6005 library. The aim is to compare test match batting scores for one player, Michael Vaughan, between the matches he played as captain, and the matches where he was not captain.
4.1 Box plots
If we want to compare scores in the two periods, we can use a box plot:
library(MAS6005) attach(mvscores) boxplot(runs ~ captain, ylab = "runs scored", names = c("Matches not played as captain", "Matches played as captain"))
mvscores data, we have two 2-level factors as independent variables (
innings), so we may wish to compare runs scored for the \(2\times 2\) groups that these define. Batting can be more difficult in a second innings, due to wear of the pitch. For those with no interest in/knowledge of cricket, the idea here is to think of
captain as the main factor of interest, with
innings as a blocking variable: we may be more interested in comparing captain/not captain scores within innings the same innings type than between different innings types.
If we specify a suitable
formula in the
boxplot command, we can display distributions for the four groups, but we should think about the order we specify the factors. We have two choices:
boxplot(runs ~ captain + innings)
boxplot(runs ~ innings + captain)
Arguably, the first choice is more useful: if we think of
innings as a blocking variable, so that we want to make comparisons within blocks: the first option plots the appropriate distributions next to each other.
Of course, we now have more tidying to do: we need proper labels for each box. We may need to use line breaks to fit all the labels in, which has a knock-on effect of needing to create more space between the tick marks and the axes labels. (Use \n inside an R string variable to produce a line break.)
par(mgp = c(3, 2, 0)) boxplot(runs ~ captain + innings, ylab = "runs scored", names = c("First innings, \nnot captain", "First innings, \nas captain", "Second innings, \nnot captain", "Second innings, \nas captain" ) )
If we ignore
innings, so that we just have have two groups, we could instead plot two histograms to compare the distributions. We’ll use the
mfrow option to draw two plots together, but you could handle this differently within an Rmarkdown/knitr document.
par(mfrow = c(1, 2)) hist(runs[captain=="no"], xlab = "Runs scored, not playing as captain", main = "") hist(runs[captain=="yes"], xlab = "Runs scored, playing as captain", main = "")
Note that we’ve supressed the title with the argument
main = "". R histograms will include titles by default, but these are usually worth removing, as a proper axis label and caption should be sufficient.
Two things stand out as needing improvement:
- the two histograms aren’t drawn on the same scale, with the same axes limits;
- the bar widths are different; this looks untidy and makes comparison of the two histograms more difficult.
Although perhaps not obvious for the plot, the number of observations in the two periods isn’t the same (61 and 86), so it would also be better to scale the total area shown in each histogram to 1.
When plotting a histogram of sample data, we might think about what shape we expect the population distribution to have, and then choose a bar width to give the best impression of this shape. Here, we actually have population data rather than sample data, but in the context of cricket scores, bin widths of size 10 would be a reasonable choice.
par(mfrow = c(1, 2)) hist(runs[captain=="no"], xlab = "Runs scored, not playing as captain", main = "", breaks = 10 * 0:20, ylim = c(0, .03), prob = TRUE) hist(runs[captain=="yes"], xlab = "Runs scored, playing as captain", main = "", breaks = 10 * 0:20, ylim = c(0, .03), prob = TRUE)
Another possibility is to place one histogram underneath the other one. This can make it a little easier to see a shift in location, although you may need to make the figure quite large, otherwise the two histograms will look cramped:
par(mfrow = c(2, 1)) hist(runs[captain=="no"], xlab = "Runs scored, not playing as captain", main = "", breaks = 10 * 0:20, ylim = c(0, .03), prob = TRUE) hist(runs[captain=="yes"], xlab = "Runs scored, playing as captain", main = "", breaks = 10 * 0:20, ylim = c(0, .03), prob = TRUE)
This can be handled differently in
ggplot2 using ‘facets’.
ggplot2 also makes it easy to overlay histograms (and density plots), but this doesn’t always give a good effect.