Section 3 The basics

Using the mtcars dataset in R (type ?mtcars for details), suppose we want to show the relationship between fuel economy and car weight. Suppose we use a minimal plot command, with no additional arguments:

attach(mtcars)
plot(wt, mpg)
A plot of fuel economy against car weight

Figure 3.1: A plot of fuel economy against car weight

Such a figure would be unacceptable in a report! As a bare minimum, you should always make sure that you

  1. include proper axes labels: never simply use the R variable names;

  2. specify the units;

  3. give sufficiently detailed captions so that the figure can be understood on its own, and include a conclusion: what do we learn from the figure?

The first two points are obvious, but the third perhaps less so, so we’ll discuss this a little more.

3.1 The caption test

You may be unsure as to whether you should include a particular plot or not. You may be tempted to ‘err on the safe side’, by including the plot, but if the plot doesn’t tell the reader anything useful, this will just lead to a bloated report (which we will penalise on MAS6005!). A simple test to apply is as follows.

State, in your caption, what conclusion the reader should draw from looking at the plot. If you can’t think of anything to say, this probably means that there’s nothing useful to be learned from your plot: leave it out of your report!

This doesn’t mean, for example, that a plot must show an interesting relationship between two variables; a plot may suggest that two variables are unrelated; that can still be informative to a reader. There will also be exceptions: it may be helpful to include a plot simply to show what data are available, but it should be obvious when you need to apply the caption test.

3.2 Choice of plotting symbol

Although this is a matter of taste, you may prefer to use filled circles rather than empty ones. Empty circles may give a better effect if there are lots of points in a scatter plot; overlapping points are easier to see. To change the plotting symbol, use the pch argument. Type ?points to see the options.

Putting this all together, an improvement on the previous attempt would be as follows (with the caption handled separately).

plot(wt, mpg, xlab = "Weight (lb/1000)",
     ylab = "Miles/(US) gallon",
     pch = 16)
Fuel economy and weight for 32 models of car (from 1973-1974). Heavier cars tend to be less fuel efficient.

Figure 3.2: Fuel economy and weight for 32 models of car (from 1973-1974). Heavier cars tend to be less fuel efficient.

3.3 Further customising scatter plots

You can use different symbols and colours to represent additional variables in scatter plots. You need to be careful, however, as

  • your reader may print your report in black and white;
  • your reader may have some form of colour blindness;
  • plots can look messy with large numbers of colours and symbols, particularly if the corresponding groups are not well separated.

(Also, if you’re preparing a plot for a talk, it’s possible the colours won’t project on the screen as you’d expect). Continuing with the mtcars dataset, suppose we want to display the number of cylinders for each model of car, on our scatter plot. The variable cyl is a numeric variable, so if we use cyl to specify a vector of colours and symbols, we will get colour and symbol numbers 4 (blue cross), 6 (pink triangle) and 8 (grey star) used in the plot:

plot(wt, mpg, xlab = "Weight (lb/1000)",
     ylab = "Miles/(US) gallon",
     col = cyl,
     pch = cyl)
Using the values of the `cyl` variable to determine the colour and symbol. The choice of colours and symbols looks a little odd here and may give the impression you haven't really thought about it!

Figure 3.3: Using the values of the cyl variable to determine the colour and symbol. The choice of colours and symbols looks a little odd here and may give the impression you haven’t really thought about it!

This may not be desirable (I think the choice of colours and symbols above looks odd). We can create new vectors to specify plot colours and symbols directly, which is a little more effort, but worth it in the end.

cyl.symbol <- rep(15, 32) # use a square for 4 cylinders
cyl.symbol[cyl == 6] <- 19 # use a (slightly larger) circle for 6 cylinders
cyl.symbol[cyl == 8] <- 17 # use a triangle for 8 cylinders

cyl.colour <- rep("black", 32)
cyl.colour[cyl == 6] <- "red"
cyl.colour[cyl == 8] <- "blue"

Now we can use these new vectors in the plot. We must also add a legend (and modify the caption).

plot(wt, mpg, xlab = "Weight (lb/1000)",
     ylab = "Miles/(US) gallon",
     col = cyl.colour,
     pch = cyl.symbol)

legend("topright", legend = c(4, 6, 8),
       col = c("black", "red", "blue"),
       pch = c(15, 19, 17),
       title = "no. of cylinders")
Fuel economy and weight for 32 models of car (from 1973-1974). Heavier cars/cars with more cylinders tend to be less fuel efficient.

Figure 3.4: Fuel economy and weight for 32 models of car (from 1973-1974). Heavier cars/cars with more cylinders tend to be less fuel efficient.

(We have to be careful here to make sure the legend matches the data correctly. Legends are easier to manage in ggplot2.)

Plots with large numbers of different symbols/colours can look a little messy, particularly if the groups are not very well distinguished. In this case, it may be worth merging groups. For example, if there was particular interest in cars with 8 cylinders, we could treat cars with 4 and 6 cylinders as a single group.