Wednesday, June 11, 2014

World Cup Prediction Evaluation Competition

16 June UPDATE: This prediction evaluation exercise is now being updated at Sportingintelligence. See it there!

15 June UPDATED: I have updated the Infostrada predictions with a more recent version, thanks to Simon Gleave (@SimonGleave).

The World Cup starts tomorrow. Prognosticators have been hard at work generating their predictions of who will advance and who will win. But which prediction is the best? Is it the one who picks the winner? Or is it the one which best anticipates the knock-out round seedings? How can we tell?

In an ongoing exercise here at The Least Thing I am going to evaluate 10 different World Cup predictions. To do this I am going to quantify the "skill" of each forecast. It is important to understand that forecast evaluation can be done, literally, in an infinite number of ways. Methodological choices must be made and different approaches may lead to different results. Below I'll spell out the choices that I've made and provide links to all the data.

A first thing to understand is that skill is a technical term which refers to how much a forecast improves upon what is called a "naive baseline," another technical term. (I went into more detail on this at FiveThirtyEight earlier this spring.)  A naive baseline is essentially a simple prediction. For example, in forecast evaluation meteorologists use climatology as a naive baseline and mutual fund managers use the S&P 500 Index. The choice of which naive baseline to use can be the subject of debate, not least because it can set a low or a high bar for showing skill.

The naive baseline I have chosen to use in this excercise is the transfer market value of the 23-man World Cup teams from Transfermarkt.com.  In an ideal world I would use the current club team salaries of each player in the tournament, but these just aren't publicly available. So I'm using the next best thing.

So for example, Lionel Messi, who plays his club team soccer at Barcelona and his national soccer for Argentina, is the world’s most valuable player. His rights have never been sold, as he has been with Barcelona since he was a child, yet he’s estimated to have a transfer market value of more than $200 million. By contrast all 23 men on the USA World Cup squad have a combined estimated value of $100 million. (I have all these data by player and team if you have any questions about them -- they are pretty interesting on their own.)

Here then are the estimated transfer values of each World Cup team:

Team Transfer value
1 Spain  $      1,044,960,000
2 Germany  $         944,160,000
3 Brazil  $         803,040,000
4 France  $         691,740,000
5 Argentina  $         657,720,000
6 Belgium  $         584,640,000
7 England  $         561,120,000
8 Italy  $         542,640,000
9 Portugal  $         517,860,000
10 Uruguay  $         364,476,000
11 Netherlands  $         348,600,000
12 Croatia  $         324,660,000
13 Colombia  $         318,931,200
14 Russia  $         308,784,000
15 Switzerland  $         299,040,000
16 Chile  $         234,864,000
17 Cote D'Ivoire  $         202,389,600
18 Cameroon  $         198,072,000
19 Bosnia and Herzegovina  $         192,780,000
20 Ghana  $         183,708,000
21 Japan  $         164,640,000
22 Mexico  $         152,964,000
23 Nigeria  $         145,908,000
24 Greece  $         134,232,000
25 Ecuador  $         105,588,000
26 United States of America  $           97,104,000
27 Algeria  $           96,096,000
28 Korea Republic  $           88,074,000
29 Costa Rica  $           49,980,000
30 Iran  $           41,076,000
31 Australia  $           36,204,000
32 Honduras  $           35,952,000

In using these numbers, my naive assumption is that the higher valued team will beat a lower valued team. As a method of forecasting that leaves a lot to be desired, obviously, as fans of Moneyball will no doubt understand. There is some evidence to suggest that across sports leagues, soccer has the greatest chance for an underdog to win a match. So in principle, a forecaster using more sophisticated method should be able to beat this naive baseline.

Here is what the naive baseline (based on rosters as of June 5) predicts for the Group Stages of the tournament: The final 4 will see Brazil vs. Germany and Spain vs. Argentina. Spain wins the tournament, beating most everyone’s favorite Brazil. The USA does not get out of the group stage, but England does. All 8 of the top valued teams make it into the final 8.

While this naive baseline is just logic and assumptions, work done by “Soccernomics” authors Stefan Szymanski and Simon Kuper indicates that a football team’s payroll tends to predict where it winds up every year in the league table. Payrolls aren't the same thing as transfer fees, of course, but they are related. Unfortunately, as mentioned above individual player salaries are not available for most soccer leagues around the world (MLS is a notable exception).

I will be evaluating 10 predictions over the course of the World Cup. There are (with links to the data sources that I have used, as of 10 June unless noted otherwise):
The predictions are not all expressed apples to apples. So to place them on a comparable basis I have made the following choices:
  • A team with a higher probability of advancing from the group is assumed to beat a team with lower probability.
  • If no group stage advancement probability is given I use the probability of winning the overall tournament in the same manner.
  • This means that I have converted probabilities into deterministic forecasts. (There are of course far more sophisticated approaches to probabilistic forecast evaluation.) 
  • No draws are predicted, as no teams in the group stages have identical probabilities.
  • The units here, in the group stage at least, will simply be games predicted correctly. No weightings. 
Other choices could of course be made. These are designed to balance simplicity and transparency with a level playing field for the evaluation. Just as is the case with respect to the value of having a diversity of predictions, having a diversity of approaches to forecast evaluation would be instructive. No claim is made here that this is the only or best approach (laying the groundwork here for identifying eventual winners and losers).

With all that as background, below then are the predictions in one table (click on it for a bigger view). The yellow cells indicate the teams that the naive baseline sees advancing to the knockout stages, and the green shows the same for each of the 10 predictions. The numbers show the team rankings according to each prediction.
I will be tracking the performance of the 10 predictions against the naive baseline as the tournament unfolds, scoring them in a league table. I'll also discuss the methods and results as well as the sensitivity of the latter to the former. When the Group Stages wind up I'll reset for a second part of the prediction evaluation.

Finally, for now, I welcome any comments on this exercise. If there are other predictions that you'd like to track in the same manner alongside these, please enter them in the comments.

Let the game within the games begin! 

2 comments:

  1. Roger, I would love a breakdown copy of the player by player & Team Transfer data you mentioned you have.
    Awesome post. Will be following intently
    krozza49@gmail.com
    Craig
    AUSTRALIA

    ReplyDelete
  2. If anyone wants the Betfair market odds as at a few hours before the first game kick-off, I'm happy to supply. I saved the odds on the eventual winner, also the odds on qualifying from each group.

    This is, in my capitalist opinion, the least naive forecast since it results solely from money flow! :)

    email me at my name at gmail dot com if you want them.

    Roddy Campbell

    Update following the Spain Holland game:

    At the start there was a 70% chance the winner would come from one of four teams, in order Brazil, Argentina, Spain and Germany.

    The impact of last night is that Spain have widened out a lot, nearly halving from 13.7% to 7.3%, Holland have joined them at 4th equal (rocketing in from a lowly 2% chance of winning the tournament before the game), and Argentina have tightened in.

    The Spanish betting move is unsurprising - a) they were rubbish, b) perhaps more importantly, their chance of qualifying has changed from 85% to 47%

    Holland are 96% to qualify, Chile 56%, ahead of Spain.

    (For those not familiar with Betfair, its huge advantage is that you can trade. eg someone who 'backed' Holland to win the tournament before the Spain game can now 'lay' Holland, and either take a cash win or continue to run some of the bet with the other guy's money. A perceived mispricing can be extracted - if I felt 50:1 on Holland was just wrong, too low, so I bought some of that, I can now trade out at 13:1. All numbers are round ones.)

    ReplyDelete