The Core Of Predicting Football Match Outcomes: Ordered Logistic Regression
This is the third post in our series by Zach Slaton, explaining how to use simple-but-effective statistical concepts that can help provide a richer understanding of the data already at your fingertips. The first post in the series dealt with how linear regression prediction intervals can yield deeper insights, while the second post explained how to use exponential regression to quantify rare events like goal scoring totals. Today on the blog Zach explains how another type of regression theory – ordered logistic regression – can be used to explain the impact of many factors on the outcome of soccer events.
Perhaps one of the biggest adjustments a North American sports viewer must make when watching soccer is the concept of a tie score at the end of regulation not leading to extra time to allow for a tiebreaker. Baseball never ends in a tie unless it’s the commissioner calling an end to a fruitless All-Star game, while basketball follows a similar format of playing consecutive overtime periods until a winner is decided. The National Hockey League technically records a tie in the two team’s playing records when normal time ends with such an outcome, but the league does utilise an extra period and a shootout to determine who won the match in overtime. The NFL plays a single overtime period before officially recording the match as a tie, but the occurrence of such an event is so rare that even some of the better players in the game don't realise such an outcome is possible. The concept of a tie is so anathema in American sports that North America’s Major League Soccer tried to Americanise the beautiful game for the first eight years of its existence (1996 through 2003) by insisting on using overtime play and penalty shootouts in regular season games in an attempt to always allow for a match winner. Thankfully, that aberration amongst others (a backwards counting clock?!?!) went away and the North American league is now aligned with the global norms for determining match results.
Punters who place soccer bets via the 1X2 method know they must take three possible match outcomes into account – a win for team 1, a tie (or draw, depending on where you’re from), or a win for team 2. The bettingexpert.com community has explained how bettors or casual observers can convert the 1X2 odds into a per cent chance of each outcome, but how can the lay statistician create a model that returns the likelihood of match outcomes?
Attempting to do so via linear regression wouldn’t be very fruitful. There would be a good bit of clustering of data points and what would appear to be a whole lot of noise with many possible predictors (x-axis) and only three possible outcomes (y-axis). The answer lies in an advanced regression method called ordered logistic regression The term sounds intimidating, but it is an extremely powerful tool that can be well understood if worked methodically through a few examples.
The Difference Between Linear and Ordered Logistic Regression
Linear regression, a topic with which many people are familiar, provides a very useful comparator to ordered logistic regression. The most striking difference comes in the visualisation of the regression relationship. As the name implies, ordered logistic regression uses a logistic relationship to build the regression equation that translates to an S-curve shape. This means there is a portion of the regression that behaves in a quasi-linear manner like linear regression, but as one works towards the extremes of the predictive data (x-axis) the S-curve will flatten out (the mathematical term is “go asymptotic”) signalling diminishing returns of the dependent variable (y-axis).
The analogy to be made at this point is that if one looks at the score of a soccer match in the 60th minute under two scenarios – a three goal lead versus a five goal lead – the difference in the likelihood of winning the match in the two scenarios is minimal (i.e. the saturated portion of the curve). However, the difference in the likelihood of winning the match under two different scenarios at the same point in time – a one goal lead and a two goal lead – is significant (i.e. the linear portion of the curve). The graph below from Wikipedia provides an example for just such a curve, where a neutral differential (x-axis = 0) indicates a 50% likelihood of either outcome happening. Going further to the left on the axis would indicate a negative goal differential, while going to the right would be a positive goal differential.
Another key difference in the two types of regression models is the need to classify the dependent variable data prior to running an ordered logistic regression. As the name implies, an ordered logistic regression requires the dependent variable data be classified as an order of outcomes – (L)oss, (T)ie, (W)in being one possible order that relies upon the alphabetical order of the words. Such data could also be classified per the points earned by outcome – 0, 1, 3 – as those values are easily expressed in order from lowest to highest.
Whatever the data being analysed, an ordered logistic regression will analyse the independent data’s impact on the likelihood of the dependent data showing up in one of the ordered, pre-defined discrete dependent variable values – L, T, W or 0, 1, 3. The use of discrete outcomes (0,1,3) rather than continuous (any value between 0 and 3) is another distinction between linear and ordered logistic regression. Thus, ordered logistic regression lends itself well to analysing the probability of discrete outcomes like the three available in a soccer match.
Uses of Ordered Logistic Regression
Moving from the abstract to the concrete, here are a few examples where ordered logistic regression has been used to analyse soccer.
Transfer Price Index
The Transfer Price Index uses a variety of models to quantify the impact player valuations, in terms of transfer fees and team wage bills, impact expected match and season-long outcomes in the English Premier League. The index has used linear regression to perform and analysis of the long-term impacts squad valuations have on a team's finish position similar to the seminal Soccermetrics analysis.
The index has also created an ordered logistic model to evaluate the impact squad and starting XI valuations have on individual results This model highlights the difference in near term results, which are far more random and less determined by spending than longer-term results. This model has been used to evaluate manager performance versus financial expectations at the match level as well as create an alternative list of Premier League champions that takes performance versus spending into account.
Projecting Points Needed for CONCACAF World Cup Qualification
The United States Men’s National Team has the seventh longest active streak of World Cup final appearances within FIFA but a generational change within the team and the rise of other nations within CONCACAF will make this year’s qualification tougher than normal. CONCACAF uses a four round method for determining who will represent it in Brazil in 2014, with the final round being a six team around robin of home-and-away matches. The top three teams automatically advance to Brazil, while the fourth place team contests a two-match playoff against Oceania's winner for one of the few remaining spots in Brazil. The point total required to finish in the top four or the top three has varied by CONCACAF tournament – fourth place has ranged between 12 and 16 points while third has ranged between 14 and 17. How can one express levels of confidence in projected finish via point totals throughout the fourth round? An ordered logistic regression of finish position by point total using data from the last four tournaments demonstrates the changing likelihoods of not qualifying, earning a playoff position, and automatic qualification for the 2014 World Cup final in Brazil.
Those were just two examples of how ordered logistic regression analysis can help advanced modellers and bettors translate independent variables into predicted outcomes. As ordered logistic regression results are expressed in terms of percentage of likelihood of the specific outcome, they’re a “two-for-one” when it comes to probabilistic thinking. There is no need to calculate the main regression equation and then prediction intervals as required in linear regression.
Unfortunately, the advanced mathematical calculations for ordered logistic regression means that it cannot be found within Microsoft Excel, and an advanced statistics package must be used to perform such analyses. The free statistical programming language R contains such functionality, while perhaps easier to use commercial packages like Minitab and SPSS also contain logistic regression analysis tools. All have easy-to-follow tutorials either within the software or online, and the user can export the equations from these analyses into an tool like Excel that makes further calculations and large data set analysis easier. Consider taking the plunge on an advanced stats package if you’ve taken basic regression analysis as far as you can within Excel’s limited offerings. Using ordered logistic regression, once it is understood, is about as easy to perform as linear regression and provides far more powerful insights.
And follow him on Twitter: @the_number_game