Probability And Randomness: The Problem With Averages

While averages might offer some insight into performance, they are often misleading. Today Dominic Yeo tells us to be cautious when working with averages.
Last time we discussed coincidences and how to calculate exactly how likely various unlikely outcomes might be. In this post, we turn our attention from analysis of rare events towards analysis of normal behaviour.
In general, when we are interested in two situations which occur repeatedly, for example, the performances of two teams over the course of a league season, we might have a large amount of data. When we compare two large sets of data, it is hard to say anything meaningful if we insist on taking into account every single piece of available information. We normally look for some summary of each of the data sets and compare these. It is particularly convenient if these summary statistics are numerical, as it is easy to compare different numbers! If nothing else, we can always say whether one number is greater than, less than or equal to another.
There will be lots of choices for how to do this. If we want to compare the batting performances of two cricketers, we could for example look at which had recorded the fewer number of ducks, or which had scored more runs in their most recent innings at Headingley. While both of these give some sort of measure of quality, obviously neither of these is particularly useful – after all, it would look very odd if these properties were listed on an endofseason table of averages.
Indeed, typically what we look for as a summary statistic is an ‘average’. The exact definition of average will depend on context, but typically it is given by the sum of all the data divided by the number of entries. So for the batsmen, this will be the total number of runs scored (within the season for example) divided by the number of innings. In cricket, one divides by the number of times the batsman was out, since there is always at least one notout batsman at the end of a team’s innings, but that doesn’t affect anything too much. This makes sense as a method of comparison. After all, it is a good thing if a player has scored many more runs than a competitor, unless it has taken very many more matches to do so.
Caution: Misleading Averages Ahead
However, as with many statistical ideas, we have to be wary that sometimes our intuition can lead us astray. To illustrate some of the dangers of throwing around averages without enough care, consider the following example. I’ve made up this particular data to make the arithmetic nice and easy, but a bit later I’ll talk about why this sort of thing might genuinely emerge.
Consider the performances of two England batsmen in backtoback series against India and Bangladesh, two teams near the top and bottom respectively of the world test rankings. Suppose these are their averages:
India  Bangladesh  
Cook  30  55 
Strauss  20  50 
From the data it seems fairly clear that Cook should have a higher overall average than Strauss, since he has scored more runs in each series. But if we now complete the table with a bit more information, everything changes.
India Innings  India Runs  India Average  Bangladesh Innings  Bangladesh Runs  Bagladesh Average  
Cook  2  60  30  2  110  55 
Strauss  1  20  20  4  200  50 
Well, so far it doesn’t look as if much has changed, but when we calculate the overall average, we see that Cook has scored 170 runs in 4 innings, giving an average of 42.5. Whereas Strauss has scored 220 runs in 5 innings, giving an average of 45.
This phenomenon is called Simpson’s paradox, and it shows up in many disparate areas where it seems reasonable to add averages.
Simpson's Paradox
As with many supposed paradoxes, it isn’t really that surprising at all. I used the phrase “since he has scored more runs in each series” a few paragraphs earlier. Of course, since at the time we only knew the averages, and not how many innings each player had batted, we could not know the total number of runs scored in each series. With the more complete table, we can explain what has happened: Strauss has played more often against weaker opposition so his average has benefited.
Note that an identical situation might be the following. Yesterday I drove to visit my parents: the route involved some motorway, where the speed limit was 70mph, and residential streets, where the limit is 30mph. I kept to the limit at all times, so my average speed was 50mph. Of course it is immediately obvious why this is nonsense. It isn’t as simple as that – it depends on how long I spent on each type of road. If the residential speed limit only affected the final 100 yards of a 200 mile journey, it would have very little impact on the overall average speed, which would be slightly less than 70mph.
So what do we learn? The key to avoiding Simpson’s paradox is to remember that it is easy to manipulate totals, but not so obvious how to manipulate averages. In particular, in most circumstances, the overall overage across two types of data is not the ‘average of the averages’. For this reason, averages are not always a good measure of ability. Like in the cricket example given, it is important to know how and against whom the totals were obtained.
Another Context
In other contexts, this paradox has a different resolution. We can present exactly the same data in a medical situation, after multiplying the number of samples. Suppose we are testing a new drug, and naturally we have to compare against a placebo as a control, where the patient is given a sugar pill or similar. Then we get:
First Trial Patients  First Trial Successes  First Trial Success %  Second Trial Patients  Second Trial Successes  Second Trial Success %  
Drug  200  60  30%  200  110  55% 
Placebo  100  20  20%  400  200  50% 
As before, since the drug gave a higher success rate than the placebo in both trials, we might say it was more promising, but the overall success rate is lower. Note that the success rate exactly mirrors the role of the batting average in this scenario. The problem here is that if you are looking to draw scientific conclusions, it is highly unsatisfactory to do so from wildly contrasting data. Did they change the drug between the first and second trial? Why else would the success rate shoot up? Let alone the placebo, which really should be constant! This is not to say that you cannot draw conclusions from the data, but there are certainly lots of questions to be answered about how it was generated.
Further Considerations
The final aspect worth addressing is whether this sort of phenomenon ever occurs in practice, and why. Variation between seasons can occur, and there is lots of analysis in baseball, possibly the most dataheavy sport anyway. In cricket, the best source of this sort of thing is comparing performances at home and away. Cricinfo has a list of players with notably large differences between home and away performances. There are several reasons why this disparity might arise: home support, knowledge of local conditions and so on. In particular, English players traditionally struggle with the spin found on subcontinental pitches, but can take advantage of their experience of the swinging ball at home. By contrast, batsmen from New Zealand, where the pitches are typically slow, green and lowscoring may well boost their averages by taking a tour to friendlier conditions.
For Simpson’s paradox to become a possibility, we also need the number of matches played home and away to vary substantially. In the toy example given, Cook played twice as many matches against India than against Bangladesh, with Strauss the opposite. In practice, one will not see such ratios over the course of a whole career, even with oneoff events like World Cups, which count as away matches for the vast majority of players. However, these can arise via a ‘ Law of Small Numbers’, the informal opposite of the probabilistic theorem the Law of Large Numbers. The idea is that we are more likely to see odd effects if we don’t have very much data. This probably isn’t much consolation to, for example, Robert Key, the Kent batsman whose England career came to halt after a string of consistently low scores, despite a respectable average of 31, derived almost entirely from a huge doublecentury against the West Indies on a flat track at Lord’s. Who knows what would have happened if he’d played enough for the Law of Large Numbers to take over?
Final Thoughts
Averages are typically based on known data. In principle, we would like to use averages as an estimate of likely future behaviour. In the next post, I will talk about errors and variance as a measure of how good an estimate an average might be for what happens in the future, and how odds are affected when different people have different views about these quantities.
Follow Dominic on Twitter: @DominicJYeo
Read more of Dominic's work on his blog EventuallyAlmostEverywhere.wordpress
Tags: Probability , Dominic Yeo