Probability And Randomness: Measures Of Randomness
How can we measure variance and what can it tell us? Today on the blog Dominic Yeo continues his series on probability and randomness.
In the last post, we talked about averages, how to calculate them and how to use them. The main danger was in attempting to calculate the average of some data that wasn’t really generated by the same source. This was particularly noticeable when we compared the effect of a drug before and after improving the drug and altering the number and type of participants.
All of the objections raised in the previous post could be summarised by saying that the average does not tell you everything about the data you are considering. Today we proceed in the opposite direction by considering the standard deviation and variance of some data, which, informally, tells you how good the average is as an estimate.
A First Example
As a concrete example, consider the following sets of data, which we could interpret in lots of ways: stock price changes or rainfall over time; heights of children in a school class; and so on.
Set A: 10.1, 10.2, 19.7, 9.8, 20.0, 20.1, 9.9, 20.2
Set B: 14.3, 13.6, 16.0, 15.2, 15.8, 14.6, 16.2, 14.3
After recovering from the shock of a string of numbers, we might calculate the average of these data sets, and they’ve been chosen so that both averages are exactly 15. But is it likely that they came from the same source?
Well no, not least because they look completely different on the page. The second set of numbers are all fairly close to 15, ranging from 13.6 to 16.2, and spread within this interval, not quite evenly, but there are certainly no noticeable gaps. By contrast, the data in the first set looks like it might have come from two sources. One source is producing numbers close to 10, and the other is producing numbers close to 20. It turns out we have five of each, and so end up with an average of 15. In particular, there are no entries between 10.2 and 19.7, leaving a rather clear gap around the actual average!
So the fact that the average is 15 is a more useful description of the second set of data than the first. But how should we measure this aspect?
What about this? We’ve said that the average is less useful for the first data set, because in fact all of the values are rather different to the average. So let’s work out the differences between each data point and the average.
For the first set we get:
4.9, 4.8, 4.7, 4.8, 5.0, 5.1, 5.1, 5.2.
Note that we have not identified whether the values are higher or lower than the average. Based on how we work out the average, and just by intuition, there should be values both higher and lower than the average. But we ignore this for now – we are only interested in how far a data point is from the average.
So if we take the average of this new string of numbers, the list of differences, we get about 4.3. For the second set of data, we get 0.8. We could therefore use this quantity, the average distance from the mean, to distinguish qualitatively between the two cases.
In fact, statisticians prefer to consider the squared difference between a data point and the average. The average squared difference between a data point and the average is called the variance. Remember that ‘squaring’ a number means multiplying it by itself. The reasons for this convention are not especially enlightening. Essentially, if you have two sources of randomness, it is easier to calculate the overall variance because sums of squares work together rather nicely. This is somewhat reminiscent of everyone’s favourite school maths theorem due to Pythagoras about relating the squares of the side lengths of a right-angled triangle.
An Example: Comparing The Premier League And The Championship
As an example, let’s take last year’s final Premier League standings, and as a way to measure how well matched the division is, we look at the number of wins earned by each team, and calculate the variance of this quantity.
So we look down the Wins column, and can work out the total by doing a big sum:
28 + 23 + 22 + 21+ ... + 6 + 4 = 272.
Then, there are twenty teams, so we have to divide by 20, to get the average number of wins as 13.6. To calculate the variance, we have to work out the square of the difference between each data point and the average. So Manchester United had 28 wins, so the difference between that and the average is:
28 – 13.6 = 14.4.
But now we also have to square this number before we add it to the rest of the collection, and this gives:
14.4 x 14.4 = 207.36.
If we proceed similarly, we get for example 88.36 for Manchester City, 0.16 for West Bromwich Albion, and 92.16 for QPR. Notice that teams who ended up a long way from the average are contributing a lot more to this sum than teams who ended up in the middle of the table. This is a common theme in probability and statistics. In most contexts extreme values of the data are more interesting and significant than typical values. So we do the sum as follows, using x^2 to mean 'x squared’.
14.4^2 + 9.4^2 + ... + 9.6^2 = 770.8.
But we want the average squared difference, so again we have to divide by 20, which is the number of teams, and we get 38.54 as the variance. Often, we consider instead the square root of the variance, which is called the standard deviation of the data. In this case the standard deviation is about 6.2 wins.
However, if we are just using this statistic to make comparisons, it doesn’t really matter whether we are using the variance or the standard deviation. For example, if we apply the same procedure to the table for the Championship, we obtain a standard deviation of about 3.3 wins.
So this is exactly the sort of calculation we need to justify the comment that the teams in the Championship are better matched than those in the Premiership. Of course, number of wins is not the only nor necessarily the best measure of performance, but to get an indication this variance calculation works perfectly well.
Next time, I will talk about applying these ideas directly in a betting context: how to interpret variance as a measure of risk, and how to decide what price to pay for extra risk.
Follow Dominic on Twitter: @DominicJYeo
Read more of Dominic's work on his blog EventuallyAlmostEverywhere.wordpress