A Simplified Football Game Prediction Model


How can we predict the outcomes of football matches? What data should we use and how should we use it? Today on the blog Ford Bohrmann takes us through the early development of a football prediction model and how we can use these techniques to develop a more advanced model and improve our betting.


As a disclaimer, I'd like to point out that I do not come from a betting background. I have never placed a bet on a football game, mostly because it is illegal to do so in the United States. That being said, I am very interested in using statistics to better understand football, something that is related to football betting.

The Hypothesis

The simple question that I wanted to answer was this: How accurate can a model to predict football game outcomes be?

Even more interesting, is this model more accurate than the odds set by bookmakers? I am sure that people have tried this before but I wanted to try it out on my own with my own model.

My hypothesis was that yes, the bookmakers could be beaten, even using a simple model.

The Variables

Obviously, there are nearly infinite variables that one could put in to a model to predict football games. For this reason, I aimed to create the model as simple as possible.

Specifically, I limited my model to just 4 variables:

- The home team's goal differential up to that point in the season.

- The away team's goal differential up to that point in the season.

- The home team's points from the previous season.

- The away team's points from the previous season.

Nothing too complicated or confusing at all. In my mind, these were 4 variables that were simple enough to use and gain access to easily, while also being important enough to create an accurate model.

The Data

In order to do this, I first needed data. I got all of the data from the RSSSF Historical Domestic Results page. If you haven't heard of this website before, bookmark it. It's a great resource covering a wide range of leagues and years.

Anyways, this data was not in the exact format that I needed. After a lot of manipulation in Excel and then R, I was able to get it in the format that I needed. This actually took up the majority of the time and effort that the project required.

In order to show the predictive power of the model, I used the 2003/2004 to 2010/2011 season (8 seasons in total) to "train" the model. In other words, the model learned from the 8 previous seasons. This is the training set.

I used this past season, 2011/2012, as the "test" set. This is a true test of the predictive power of the model because the model had not yet seen the 2011/2012 data. The accuracy is the predictive accuracy of the model on new data.

The Model

I had a number of options for the actual statistical technique used for the model. There are two main ways people have predicted game outcomes in the past.

First, people have simply predicted the game outcome as a percentage. Second, people have predicted the goals scored for each club in the game and then used that to get a percentage representing the odds of each outcome. Ultimately, I chose to go with the first option because it is a bit easier and simpler to implement.

After choosing this, there were also a number of techniques to choose from to get outcome predictions. The most common and one that came to mind first was a probit model. Essentially, this is just a linear regression where the dependent variable can only take two variables. In our case though, we need the model to predict three outcomes (win, draw and loss). There is a slight variant on the probit model called a multinomial probit model which allows for the dependent variable taking more than two outcomes.

Ultimately, I decided to go with a different approach that was much simpler. Instead, I used a machine learning technique called Random Forest, which would classify, based on the input data, the outcome of each game. I used this model to predict the probability of each outcome (win, draw and loss) occurring. To be honest, I don't understand the full power and statistical reasoning behind the random forest model, but I do understand its basic idea. For a great explanation of it, read the Quora topic on it explaining it in layman's terms.

The Benchmark

I used a couple of benchmarks to test the validity of my model. First, I assigned a random likelihood to each outcome of every game. To understand this, imagine your drunk friend spitting out random odds for each outcome throughout the season. Swansea is at home against Liverpool, and your friend tells you there's a 15% chance Swansea wins, a 58% chance of a draw, and a 27% chance of a Liverpool win. Imagine your friend doing this random process for all 380 games of the season - I want the model to at least be more accurate than your drunk friend.

The second (and more accurate) benchmark I used was the actual betting odds. Because when you convert odds to a percentage likelihood the sum is greater than 100% (this is how the betting companies make money). I normalised these numbers so that they added up to 100%.

Using these numbers, I tested the accuracy of my model against the accuracy of using the betting numbers. If my model is more accurate than the betting odds, we are on to something.

Quantifying the accuracy of the model is a somewhat tricky thing to do. I did this by taking the geometric mean of the odds assigned to the outcome that actually occurred. For example, if the model said that there was a 50% chance of the home team winning and the home team ended up winning, we would take the value of .5 from that game. If you do this for every game, and take the geometric mean of all the odds assigned to each outcome that actually occurred, you get a pseudo measure of accuracy of the model. A higher number implies a higher accuracy. If you were somehow able to choose correctly all 380 games in the season, you would have an accuracy of 1. If you chose all of them wrong, you would have an accuracy of 0. Because the model assigns a probability between 0 and 1 to each outcome, we are going to fall somewhere in between.

The Results

OK, so how did the simplified model actual perform compared to the benchmarks? Pretty well, actually. Turns out your drunk friend coming up with random guesses at each outcome did not do very well. He scored an accuracy measure of .25 for the 2011/2012 season.

What about the odds makers? Specifically, I looked at the normalised odds for Bet365. They were even better, scoring an accuracy of .34 for the 2011/2012 season.

Finally, the results of the simplified model described above. My Random Forest model scored an accuracy of .33 for the 2011/2012 season. Yes, this is below the accuracy when using the betting odds approach. However, for a model of just 4 variables it's not too bad in my mind.


What are the next steps? I'd like to include some more variables in the model to make it more accurate. Some possibilities I had in mind were the transfer spending of the home and away team or some more detailed statistics like passing or shooting metrics. If you have any suggestions for more variables I'd love to hear them in the comments section.

Overall, my takeaway from this is that betting odds are not very accurate for predicting football outcomes, considering a fairly primitive model is almost as accurate. The opportunity to beat the odds is definitely present, although it might take some work to make it accurate enough to actually make money in the long run.



Ford is the editor of the blog Soccer Statistically. He also contributes to EPL Index.

You can also follow him on Twitter: @SoccerStatistic