A Simplified Football Game Prediction Model
Blog at Soccer Statistically. Contribute stats articles for EPL Index. Economics student. Play soccer at Haverford College.
How can we predict the outcomes of football matches? What data should we use and how should we use it? Today on the blog Ford Bohrmann takes us through the early development of a football prediction model and how we can use these techniques to develop a more advanced model and improve our betting.
As a disclaimer, I'd like to point out that I do not come from a betting background. I have never placed a bet on a football game, mostly because it is illegal to do so in the United States. That being said, I am very interested in using statistics to better understand football, something that is related to football betting.
The simple question that I wanted to answer was this: How accurate can a model to predict football game outcomes be?
Even more interesting, is this model more accurate than the odds set by bookmakers? I am sure that people have tried this before but I wanted to try it out on my own with my own model.
My hypothesis was that yes, the bookmakers could be beaten, even using a simple model.
Obviously, there are nearly infinite variables that one could put in to a model to predict football games. For this reason, I aimed to create the model as simple as possible.
Specifically, I limited my model to just 4 variables:
- The home team's goal differential up to that point in the season.
- The away team's goal differential up to that point in the season.
- The home team's points from the previous season.
- The away team's points from the previous season.
Nothing too complicated or confusing at all. In my mind, these were 4 variables that were simple enough to use and gain access to easily, while also being important enough to create an accurate model.
In order to do this, I first needed data. I got all of the data from the RSSSF Historical Domestic Results page. If you haven't heard of this website before, bookmark it. It's a great resource covering a wide range of leagues and years.
Anyways, this data was not in the exact format that I needed. After a lot of manipulation in Excel and then R, I was able to get it in the format that I needed. This actually took up the majority of the time and effort that the project required.
In order to show the predictive power of the model, I used the 2003/2004 to 2010/2011 season (8 seasons in total) to "train" the model. In other words, the model learned from the 8 previous seasons. This is the training set.
I used this past season, 2011/2012, as the "test" set. This is a true test of the predictive power of the model because the model had not yet seen the 2011/2012 data. The accuracy is the predictive accuracy of the model on new data.
I had a number of options for the actual statistical technique used for the model. There are two main ways people have predicted game outcomes in the past.
First, people have simply predicted the game outcome as a percentage. Second, people have predicted the goals scored for each club in the game and then used that to get a percentage representing the odds of each outcome. Ultimately, I chose to go with the first option because it is a bit easier and simpler to implement.
After choosing this, there were also a number of techniques to choose from to get outcome predictions. The most common and one that came to mind first was a probit model. Essentially, this is just a linear regression where the dependent variable can only take two variables. In our case though, we need the model to predict three outcomes (win, draw and loss). There is a slight variant on the probit model called a multinomial probit model which allows for the dependent variable taking more than two outcomes.
Ultimately, I decided to go with a different approach that was much simpler. Instead, I used a machine learning technique called Random Forest, which would classify, based on the input data, the outcome of each game. I used this model to predict the probability of each outcome (win, draw and loss) occurring. To be honest, I don't understand the full power and statistical reasoning behind the random forest model, but I do understand its basic idea. For a great explanation of it, read the Quora topic on it explaining it in layman's terms.
I used a couple of benchmarks to test the validity of my model. First, I assigned a random likelihood to each outcome of every game. To understand this, imagine your drunk friend spitting out random odds for each outcome throughout the season. Swansea is at home against Liverpool, and your friend tells you there's a 15% chance Swansea wins, a 58% chance of a draw, and a 27% chance of a Liverpool win. Imagine your friend doing this random process for all 380 games of the season - I want the model to at least be more accurate than your drunk friend.
The second (and more accurate) benchmark I used was the actual betting odds. Because when you convert odds to a percentage likelihood the sum is greater than 100% (this is how the betting companies make money). I normalised these numbers so that they added up to 100%.
Using these numbers, I tested the accuracy of my model against the accuracy of using the betting numbers. If my model is more accurate than the betting odds, we are on to something.
Quantifying the accuracy of the model is a somewhat tricky thing to do. I did this by taking the geometric mean of the odds assigned to the outcome that actually occurred. For example, if the model said that there was a 50% chance of the home team winning and the home team ended up winning, we would take the value of .5 from that game. If you do this for every game, and take the geometric mean of all the odds assigned to each outcome that actually occurred, you get a pseudo measure of accuracy of the model. A higher number implies a higher accuracy. If you were somehow able to choose correctly all 380 games in the season, you would have an accuracy of 1. If you chose all of them wrong, you would have an accuracy of 0. Because the model assigns a probability between 0 and 1 to each outcome, we are going to fall somewhere in between.
OK, so how did the simplified model actual perform compared to the benchmarks? Pretty well, actually. Turns out your drunk friend coming up with random guesses at each outcome did not do very well. He scored an accuracy measure of .25 for the 2011/2012 season.
What about the odds makers? Specifically, I looked at the normalised odds for Bet365. They were even better, scoring an accuracy of .34 for the 2011/2012 season.
Finally, the results of the simplified model described above. My Random Forest model scored an accuracy of .33 for the 2011/2012 season. Yes, this is below the accuracy when using the betting odds approach. However, for a model of just 4 variables it's not too bad in my mind.
What are the next steps? I'd like to include some more variables in the model to make it more accurate. Some possibilities I had in mind were the transfer spending of the home and away team or some more detailed statistics like passing or shooting metrics. If you have any suggestions for more variables I'd love to hear them in the comments section.
Overall, my takeaway from this is that betting odds are not very accurate for predicting football outcomes, considering a fairly primitive model is almost as accurate. The opportunity to beat the odds is definitely present, although it might take some work to make it accurate enough to actually make money in the long run.
Ford is the editor of the blog Soccer Statistically. He also contributes to EPL Index.
You can also follow him on Twitter: @SoccerStatistic
You must be logged in to post a comment! Sign up + or log in in the top right corner.
I just read an academic paper on predicting football and cricket results. One of the findings was that three significant variables for both sports were: team quality, home advantage and current form. How you define/calculate these three variables is the tricky but not impossible bit. The authors achieved up to 15% return on investment. Mark
From your post ....if the model said that there was a 50% chance of the home team winning
and the home team ended up winning, we would take the value of .5 from
Also......If you chose all of them wrong, you would have an accuracy of 0. Because
the model assigns a probability between 0 and 1 to each outcome, we are
going to fall somewhere in between.
So a 0 (get all wrong)is possible but 1 is not possible as no match/bet would ever be priced at 100% for any one outcome eg HOME 100% DRAW 0% AWAY 0%.
SO what is a good number?
That number you get .33 and the bookie .34 are kind of interesting, I have a system to find errors and my figure for these bets is .43, this gets me about 15-20% ROI on average across all my bets. I am still looking for the link between the two but seems any system around (.33 .34) breaks even so this must be the tipping point where you can gain the edge over the bookie.
Also in a even book 2-1,2-1,2-1, the answer has to be .33..... so .33 must be the tipping point by law of average, interesting that the bookie shows this exactly...I can not explain the drunk friend :) .25 could of been a very unlucky day, Im sure a .4 or .5 is EXACTLY what can happen to the same drunk friend now and then,
Ive been there.. lol..
Price dictates bets/tips. BUT... form/stats dictate price..If we could just get a clearer picture of the bookie errors and jump on them. This is exactly what a tip entails.
A good 12X tipster is not good at predicting matches, he is good at finding pricing errors...
I'm 100% searching for the holy grail.
From my experience relying too much on past results between teams in football does not yield much. Form is what you should be after. Try sticking to fixture histories and you will realize how erroneous the pattern output becomes. This is not to say that it bears no weight but it certainly is much less than many think.
Interesting article. Be interesting to see if you make any improvements. I have recently gone down the route of an ordinal logistic model so bringing out a probability for the draw. My variables are similar to your although i have built in a quasi subjective factor which is a rating that is set at the start if the season based on last season results and other factors. i.e Chelsea are obviously better than their 6th place last year. See also Man City in earlier years
I like your method of benchmarking. I will have a look at that
Please, i would like to see the source code in R...
Im very interested in it... Dont know how to put parameters and implement Random Forest technique.
Would be grateful for a response.
@rafrochen: Personally I am not sure about that. I have built similar systems in the past, for baseball though, not soccer. But it's a sport where teams meet very often within a season. And in relation to the betting odds the results were less than thrilling.
Very interesting article. I find that previous results between the teams are a very important variable when developing these types of methods, I think it could definitely get you some leverage.