The 2018 FIFA World Cup starts tomorrow, and we decided to use Dataiku to predict not only who will win overall but also how far France, England, and Germany will go (we tried to include Italy too but were not able to make any predictions due to lack of data; tried to include the United States too, but oh wait… ).
Of course, you can find some pretty extensive predictions for the outcome of the 2018 FIFA World Cup online. But for this project, we wanted to show that in less than 24 hours from the start, we could build something with Dataiku that was efficient, easy, and quick.
And the Winner Is...
Our model predicted that Germany will win the 2018 FIFA World Cup against Spain. Surprised? Well if it happens, just remember: you read it here first.
Here’s a closer look at some specific teams’ results:
- France will finish first in group C despite drawing with Denmark. They’ll beat Croatia in the round of sixteen, but they’ll get knocked out by Spain in the Quarter final.
- England will also be finish first in their group (G), with no defeat. They’ll beat Poland in the round of sixteen but like in 2002, will lose against Brazil.
- Spain will be second in group B, behind Portugal. They’ll defeat Russia, France, and Brazil, but they will be beaten by Germany.
- Germany will be first in group F. They’ll defeat Switzerland, Japan, Argentina, and Spain in the final.
Making this assumption true, as always: “Football is a simple game. Twenty-two men chase a ball for 90 minutes and at the end, the Germans always win.”
Some counter-intuitive results we got:
- Iceland ended up last in group D in our predictions, which is surprising based on their 2016 Euro Results! It’s probably due to the fact that apart from the Euro Cup, they did not play that much in international competitions and therefore have too narrow of a sample.
- Also, Japan is supposed to defeat Belgium in the quarter finals which would be a shock given the relative strength of Belgium. So we’ll see about that.
Now that you’ve seen the outcome, read on to see how we got there.
Data Cleaning: Quicker than Gareth Bale
For the raw data, we used a dataset that gathers all international plays between international teams recorded since 1880. This dataset contains:
- The score of each team
- Tournament style
- Game date and location
To keep it relevant, we only used results after 2000 (we were pretty sure no players in the 2018 FIFA World Cup had played on national teams before 2000 - although it turned out we were wrong).
From there, we got rid of each and every outlier - that is, plays with more than a six goals spread between the two competitors. That enabled us to forget Brazil-Germany in the 2014 FIFA World Cup semi-final.
Last but not least, we tried to avoid blurring the final results because of national soccer teams that do well in continental competition but rather weak on the worldwide scale (hello, Australia!). Yes, that’s a bias that would have lead us to overweight European countries.
The second dataset we used contained all the 2018 World Cup plays so as to break down teams in their groups.
Model Prediction: Smoother Than Messi’s Dribbles
To arrive at our predictions, we ran different algorithms on the created dataset to determine teams’ scores. Thanks to Dataiku’s quick model feature, we decided to use the XGBoost algorithm.
We applied the XGBoost algorithm to each group match and ended up with real numbers, not integers. We of course had to make draws possible, so we therefore decided that a specific interval between scores would mean a draw (e.g., if France vs. Australia was 1.72-1.48, that would be a draw). We adjusted the intervals’ length to get a draw proportion similar to the draw proportion in the training dataset.
Once we got the group results, we re-applied the same prediction algorithm for final phases to find out who will be celebrating on July 15th (hint: not the Dutch team).
Dataiku: The Zidane Every Data Team Needs (Minus the Head Butting)
Developing this project took us less than a few hours but nevertheless let us uncover the overall landscape of the 2018 FIFA World Cup. Dataiku allowed us to replicate workflows so that we did not have to recreate them every time we were trying something else, which saved tons of time.
As I joined the company lately, I’m not yet a Dataiku expert. So I really loved being able to use Python code when I didn’t know exactly how the Dataiku visual interface options worked.
“The Ball is Round, the Game Lasts Ninety Minutes, and Everything Else Is Just Theory.”
As we’ve said before, this was a short and simple project, but it could totally be improved by using FIFA players’ datasets to enhance the overall granularity.
But more importantly, our predictions now remain to be challenged by real life. See you on the pitch (or, more probably, at the bar).
“I'm going to make a prediction - it could go either way.”
PS, if you want to give Dataiku a try for your own little project (FIFA or otherwise), you can download the free version here.