Predicting the 2019 Rugby World Cup winner
By: Oli Plaistowe and the Solutions Team at World Programming, UK
The 2019 Rugby World Cup will determine which nation will be taking home the Webb Ellis Cup. People from all backgrounds come together to discuss, support and of course predict the outcome of their home team during major sporting events. We thought it would be fun to challenge the cognitive prowess of our data science team to build a model that would accurately predict who will win. Not only did we want to predict the overall winner, we went as far to predict the outcome of each game too. This task was even more daunting as our data scientists had absolutely no prior knowledge of Rugby!
We asked ourselves the question: ‘Can data provide better predictions than an expert in the field of Rugby?' We had the data and the brains, all we needed to do was enlist the help of the expert. We turned to someone who knows firsthand what it's like to lift the Webb Ellis cup above his head; ex-England international Simon Shaw MBE, the Lock in the 2003 England World Cup winning squad. We were confident that we had found our domain expert!
Simon Shaw MBE
- 71 Caps for England
- 3 British and Irish Lions Tours
- 17 appearances
- 2 Tests
- First player to reach 200 Premiership appearances
- First Lock to ever kick a successful drop goal!
Whether you are an ex-professional with years of rugby experience like Simon, or you just take part in your office sweepstake, we wanted to create an easy workflow example to help get you started on your prediction journey. Although this was a bit of fun, we wanted to simulate how analytics problems within sporting events are very similar to projects that are encountered daily in the commercial sector. The adoption of analytics with 'machine learning' is rapidly increasing, however, data itself cannot be leveraged unless a human can define the problem and interpret the insights to provide context to the decision making. Using a traditional approach to predictive modelling, we created a model without domain knowledge, then engaged with Simon, our expert, to optimize and improve our model.
Defining the Problem
Win one match, win the World Cup? The problem was defined by establishing the likelihood of a participating team to win each match they contested at the World Cup, with the highest propensity scores moving onto the next round and identifying a winner for the final. If the challenge wasn't hard enough, we restricted ourselves to just four days to complete the model.
Our data scientists were tasked with:
- defining the DV (dependent variable) in this case being Win = 1 and Loss = 0
- data capture
- data preparation into the mining view
- model build
- model assessment and validation
- model refinement
With any analytical delivery, we had two aims:
- create a powerful predictive model, and
- be able to explain the drivers in the model.
We found using a scorecard to be the most intuitive way to explain the predictive drivers of each game, however the results required normalization to produce a win percentage.
When searching for data points on a subject of which you have little to no experience, it is important to validate the source for accuracy and reliability. In a domain such as sport, endless options of secondary data are published on a wide range of sources from journals through to fan sites, but ultimately we focused on publicly-available statistics as well as collecting all relevant historical weather readings.
However, due to the time constraints, it wasn't feasible to link weather conditions to the individual matches. Instead, an average temperature was reviewed to see if conditions had an even impact on national sides. With more time we would have liked to work with 'sporting data' companies to get further stats, which could provide more granular and fit for purpose data points.
Raw Data Points
|General Statistics||Player Statistics||World Cup Statistics|
Number of matches
Year of Match
Head to head stats
Number of matches
Year of Match
Number of Yellow Cards
Year of Match
Number of matches
Number of Red Cards
Determining the mining view is a key part of every data science project. Given our data came from several data sources, it was useful to display the data preparation in a workflow. The data source was available in varying forms, so we decided to scrape data by year and country. Another element of planning was what we were going to predict and how we would partition the data for testing and validation.
We chose to predict the outcome of each match; although there might be additional benefits to a World Cup specific model, the World Cup happens every four years and there are not enough data points to train a suitably predictive model. Leveraging the SAS language, a mining view was created so that insights could be extracted.
|Mining view component||RWC scorecard|
|Unit of analysis||Match level|
|Sample size||1,750 matches, 2 observations per match for a 50%-50% win-loss ratio. 3,500 observations in modelling view|
|Performance window||All games prior to 2019 World Cup from 2004|
|Observation window||Historical match information over the period of fourteen years|
|Independent variables||Mixture of nominal, ordinal and interval data, such as aggregated values, flags, ratios, time and date values|
|Dependent variable||Win status (1 or 0)|
Draw matches removed to maintain a binary model
|Data sources||Match data, player data, team data, environment data|
Initially, the mining view consisted of more than 700 variables derived as the result of data preparation. Using various techniques such as clustering, significance testing and correlation analysis, we removed variables that were closely related and represent similar trends. We were left with 40 most influential predictors, which were then fine-tuned to reveal the optimal combination.
Perhaps the most obvious insight or data validation, was the higher the average number of games won in the previous year, the higher the probability of winning the next match.
More interestingly, we found that winning the final five games before the tournament increases the likelihood of winning the World Cup – scientific proof of the "winning streak".
Win ratio of previous 5 matches
Result of second to last match
The second to last game is a better predictor than one immediately before the tournament starts.
Contrary to our initial thoughts, teams with more yellow cards in a World Cup tournament are more likely to win. This could, however, just indicate teams that got further in the tournament and had more opportunity to get yellow cards, or it could point to a more aggressive style of play where receiving a yellow card and winning have a correlation.
Yellow cards received in World Cup series
Looking into the number of games played since 2004, Australia (226), New Zealand (218) and South Africa (211) have had the most matches. This correlates with the success of the nations as they account for seven out of eight world cup victories. This suggests that the more experience a side has, the greater likelihood they will win. This is further supported by the nations with less experience, for example Namibia has the least amount of games since 2004 and it correlates to their win percentage (see below).
Using the WPS Analytics workflow enabled the data scientists to collaborate by sharing the same workflow template, whilst applying different modeling approaches.
Improving Model Performance
Model tuning increased the model's predictive power by removing the variables with marginal contribution and tweaking the configuration parameters. The optimal model was identified by comparing the ROC curves and c-statistic in the Model Analyser; this helped to speed up the model assessment process.
The MLP, Decision Forest and Logistic Regression techniques all produced similarly predictive models.
From the selected techniques, Logistic Regression can be converted into a Scorecard Model which allocates scores to each predictive variable. The ability to clearly present our model in this use case outweighs additional accuracy of black box techniques such as MLP.
For each model, a pool of predictors was verified using optimised grouping in the decision tree editor. The score should increase in the same direction as the grouping that improves the win likelihood. It is important to remove variables that do not follow this, as it reduces the predictive power of the model.
Our final model highlighted four predictors:
- Number of losses in the previous year
- Number of wins in the previous year
- Ranking in the previous year
- Win ratio of the teams' last five matches with the current opponent
When looking at the scoring, it is clear to see that opponent and ranking make a great contribution to the model.
Data Driven Model Vs Rugby Expert
|Finalist||South Africa||New Zealand|
|Finalist||New Zealand||South Africa|
|Winner||New Zealand||South Africa|
The adoption of using Analytics and 'machine learning' is rapidly increasing. However, data itself cannot be leveraged unless a human can define the problem, interpret the insights to provide context to the decision making.
We used the World Cup to demonstrate the different approaches with the use of data without context, the domain knowledge without data points, and then a hybrid approach below is Simon's Feedback.
We gave Simon the initial view of the data science scorecard and asked him to comment. Although we had same finalists, the scorecard showed unusal groupings with Romania, Georgia and Italy.
The cause of this was due to the optimal binning algorithm that we used binning opponents in a inconsistent way. As shown below, the countries are binned by number of wins whilst disregarding the tier of opponent played. If they only play weaker teams this would increase the win ratio, but not be an accurate reflection of their strength.
|Opponent||Argentina, England, Fiji, Japan, Romania, Samoa, Wales||2|
|Australia, France, Georgia, Ireland, South Africa||-44|
|Canada, Scotland, Tonga||44|
|Italy, Russia, USA||77|
|Namibia, Portugal, Uruguay||126|
After consulting Simon, we took his advice and modified the model to include two more variables; one containing the tier of the team, and one which noted the team's hemisphere. According to Simon the team's tier is crucial in identifying the quality of the side. As seen in an earlier insight, nations may have a high win ratio, but may not be considered a top side; this is down to the teams they play.
A teams' hemisphere was added as a variable as Simon believed there are differences in the culture of the game, moreover, when a nation plays in an opposition hemisphere an adaptation is needed and many teams struggle to do so.
As the graph illustrates, Georgia, with a win ratio of 49%, would be considered a strong team for the competition. This success, as identified by the domain expert, has come from mostly playing against tier two teams. Nations in tier one with a high win ratio would naturally be considered strong teams in the competition.
In contrast, Italy has a low win ratio as most of their games are played against tier one teams, but they could be considered a stronger team than Georgia. In order to fairly judge teams, we therefore need to distinguish between teams in the tiers. This demonstrates the importance of domain knowledge in data analysis.
Following Simon's advice, we added variables for tier and hemisphere, and decided to re-impute data using this new information.
Previous data imputation was used to estimate results for teams based on the aggregated median of their winning ratio against all teams. The new variables enabled us to tune the model, taking into account a team's winning ratio against teams in their own tier and hemisphere. This gave us a more accurate representation of how a team would perform against the opposition, adding 16 different segments to replace missing variables.
On reflection, we only had four days to work on the project. If we had had more time, we would have captured and incorporated more of Simon's feedback as this was undeniably valuable. Certain data we would like to have added included player physical statistics such as age, height, and weight. Simon Shaw discussed how data science is increasingly becoming a part of sports, which means there is more data on player and team game behavior, such as average time to get the ball out from ruck, something at which New Zealand excel.
Just as the domain expert becomes an essential aid to data science, modelling can assist in minimising the confirmation bias frequently seen in sporting events, where fans become so emotionally invested they let their heart rule over their head.
Data science can achieve a lot on its own, but the real magic to make it fit for purpose happens with successful collaboration with domain experts. The inputs received from Simon boosted our AUC from AUC on test = 0.84 to AUC on test = 0.89.
So, after all that, the question we originally set out to answer was ‘Who will Win the 2019 Rugby World Cup?' According to our model, the answer is England!
If you would like access to the dataset to build your own model, and to get a trial version of our software, please email email@example.com with the subject "Rugby World Cup"