Eliciting a distribution for a cricket score

In this post I’ll illustrate the process of eliciting (my own) distribution following a SHELF approach. The true value of the uncertain quantity is now known; you’ll have to take my word for it that I haven’t cheated! But in any case, the purpose of this post is not to validate the method, or to test my own ability in making probability judgements: I want to see to the extent to which I can justify my probability values, based on the evidence I have available to me.

The quantity of interest

I’m going to use a cricket example, but I’ll try to write this so that you can follow the elicitation without knowing anything about cricket.

It’s the end of the first day of a five-day match between Sri Lanka and England (November 6th, 2018). The England batsmen Ben Foakes has scored 87 runs not out. My quantity of interest \(X\) is how many additional runs he will score in his first innings (so his total score will be \(87+X\)). I will treat \(X\) as continuous.

My evidence dossier

I first have to compile all the evidence on which I’ll base my judgements. There is a wealth of data available (which does make the elicitation problem somewhat easier than some others), but none of it is directly relevant, in that Foakes is playing his first international match; this situation has never occurred before. I will have to consider the relevance of each piece of evidence.

One could go a lot more in depth here, but I’ll restrict myself to the following. (If you’re not a cricket enthusiast, you can skip over this list, but one detail that’s helpful to know is that Foakes has to bat with a partner; it’s important how good his two remaining partners are.)

  1. The state of the match at the end of day 1. (See the day 1 summary at the bottom of the linked page)
  2. Foakes’ domestic cricket batting records (Average 40.6, eight hundreds, highest score: 141 not out)
  3. The batting records for the two other remaining batsmen: Leach (domestic) and Anderson (international). (Averages of 12.55 and 9.8 respectively)
  4. Previous 9th and 10th wicket partnerships for England in Sri Lanka. (Previous highest 25 and 41 respectively.)
  5. Scores above 100 by players on their test match debuts. (Record for the ‘modern era’ is Jacques Rudolph’s 222 not out. Most recent English batsmen to do this was Matt Prior with 126 not out.)

‘Missing’ evidence

One can also think about evidence that might exist, but hasn’t been provided. For a real SHELF workshop, the organisers may try to obtain it, but simply making a note would at least flag a point for consideration. An example here would be

  1. I think there is an increased risk of a batsman getting out at the start of a new session/day; the bowlers will be rested, and the batsman will need to regain his concentration. I can recall instances of this happening, but I don’t have any data, and there’s a risk of an availability bias.

My plausible range

Choosing a lower plausible limit is easy: 0 is a real possibility here, so I’ll set \(L=0\). Choosing the upper plausible limit \(U\) is harder. I could simply pick some large value, say 1000; that would be an absurdly high value (although…) But I don’t think this is good practice. In general, one needs to give serious consideration to what the extremes could be, and if I just causally pick some high number, I haven’t done this. Instead, I might

  • search for the most comparable data I can find related to extreme values;
  • make a judgement about how relevant those data are to the situation I have here.

I’ve picked out item 5 in my evidence dossier as being most relevant. Now I need to consider:

  • how do the conditions relating to the observations in item 5 compare with those here? Are they more or less favourable towards large values of \(X\)?

I think the conditions here are considerably less favourable to large \(X\), based on the match situation (item 1), the two remaining batsmen (item 3), and the point I raised in item 6. I’m going to think in terms of multiples of 50, which I think in the context of cricket makes some sense: each multiple of 50 is acknowledged as a sort of milestone. Looking at item 4, although I think \(X>50\) is unlikely, I wouldn’t rule it out. I think \(X>150\) is too implausible, from what I’ve said about item 5 and the conditions here, so I’ll settle on \(U=100\) for my upper plausible limit.

A comment on uniform distributions

I can’t claim that there is a ‘right’ distribution given my evidence dossier, but I think one can argue that some distributions are ‘wrong’. What about a uniform distribution over \([0, 100]\)? This might seem like a conservative or ‘vaguely informative’ choice, but it’s important to think about just how much probability this distribution would give to larger values of \(X\). The evidence really doesn’t support all values in this range equally; I think it’s more supportive of values closer to 0 than to 100, so I think it would be hard to justify this choice of distribution, given the evidence available.

Eliciting tertiles

I like thinking in terms of tertiles: judging a 33rd and 66th percentile for \(X\), which I’ll denote by \(T_1\) and \(T_2\). I get three intervals \([L, T_1]\), \([T_1, T_2]\), and \([T_2, U]\), each of which I think has a ‘reasonable chance’ (1 in 3) of containing \(X\), but I would bet against a specific interval containing \(X\). I wouldn’t be claiming, for example, that we should ‘expect’ \(X\) to lie in \([T_1, T_2]\): I think it’s twice as likely to lie outside.

In judging \(T_1\), I think are two separate plausible ‘mechanisms’ that would give low \(X\): either Foakes gets out (item 6), or his two partners get out quickly (item 3). My precise choice of \(T_1\) is a little arbitrary, but I think \(T_1=10\) is reasonable in describing a moderate (1 in 3) chance that Foakes adds little to his score.

I find it a little harder to decide where to place \(T_2\), but I will set it at 25. I think Foakes will need some support from at least one of the other batsmen to do this, and I think it’s possible one of them will stick around (I can still remember the earnest applause for Peter Such’s 50 ball duck…). But based on item 4, I still wouldn’t expect \(X\) to be very large.

I also need to choose my median value, which has to be between 10 and 25. I’ll set this slightly closer to 10, choosing a median of \(M=15\). One point of reference here is that \(X=13\) would take him to 100 overall; he and his partners will be highly motivated to get 100, and I think it more likely than not, that he achieves it. It’s hard to be precise about this, though very precise placement of my median will probably not be important.

Fitting and feedback

I’ll now fit distributions to these judgements, and look at the implied 5th and 95th percentiles (for three distributions only: a gamma, a log normal, and a log Student-\(t_3\)):

v <- c(10, 15, 25)
p <- c(0.33, 0.5, 0.66)
myfit <- fitdist(vals = v, probs = p, lower = 0, upper = 100)
feedback(myfit, quantiles = c(0.05, 0.95), values = 13)$fitted.quantiles[, 3:5]
##      gamma lognormal   logt
## 0.05  1.39      2.64   1.56
## 0.95 65.10     92.70 157.00

There’s nothing to choose between the fitted 5th percentiles, but the fitted 95th for the log normal and log Student-\(t\) are too high for me; the gamma seems more reasonable. So this would be my final elicited distribution:

plotfit(myfit, d = "gamma", ql = 0.05, qu = 0.95)
My chosen distribution for $X$. The fitted 95th percentile is 65.

Figure 1: My chosen distribution for \(X\). The fitted 95th percentile is 65.

Note that feedback is essential here: there’s nothing to choose between the distributions regarding their fit to my tertiles and median; I have to see what they do in the tail.

Gamma, log normal, and log Student-$t$ distributions fitted to my elicited tertiles and median. For the sorts of distributions we elicit and fit in SHELF, several distributions will fit equally well: we have to use feedback (as well as considering the context) to choose between them.

Figure 2: Gamma, log normal, and log Student-\(t\) distributions fitted to my elicited tertiles and median. For the sorts of distributions we elicit and fit in SHELF, several distributions will fit equally well: we have to use feedback (as well as considering the context) to choose between them.

And finally…

Foakes’ was out for 107 on day two, so the true value of \(X\) was 20. For this sort of problem, it’s probably not too hard to be conservative and avoid overconfidence; it’s perhaps harder to avoid underconfidence. I was fairly comfortable in using the evidence to choose \(T_1\), but I found it harder to think about how large \(X\) might reasonably be, and place \(T_2\).