Do you know Kaggle and its slogan “making data science a sport”?
Kaggle is a nice platform for predictive modeling competitions where the best data scientists face against each other, trying to improve their models by 0.01 point of performance.
At Dataiku we love challenges so we jumped into one of these contests: the blue book for Bulldozers.
So here is what we did.
The goal of this contest was to predict the sale price of bulldozer based on its model, age, description and a bunch of options.
Two datasets were released:
- the train set, with 400 000 auctions from 1989 to april 2012 (sale price known)
- the test set, with 27 000 auctions from may 2012 to november 2012, for which we had to predict the price.
After a short round of data exploration, we quickly understood that there were different categories of bulldozers with significant variations of price. So our main idea was to used a random forest on each category of bulldozer.
A random what? Here a great explanation of random forest concept.
The first difficulty was the data itself, 53 features, with a lot of missing or erroneous values. Real life data science... For example, some trucks were sold before their year of manufacture.
Others were sold more than once, so a machineID could appear many times in the train set, and the description could change between these different rows.
Kaggle provided us with a machine appendix with the “real” value of each feature and for each machines, but it turned out that replacing by the true value was not a good idea. Indeed, we think that each seller could declare or not characteristics on the auction website and it had an impact on the price.
As for the second point, we focused on the volatility of some models. We spent a lot of time trying to understand how a machine could be sold the same year, and even with only a few days between two sales, at two completely different prices. It turned out for us that it was not easily predictable. In financial theory, the model used to describe this kind of randomness is call random walk.
We tried a lot of things:
- we decomposed each option in new binary features.
- we added the age from the sale date and the year of manufacture.
- we added the day of the week, the number of the week in the year.
- we also tried to add the number of auctions of the current month to try to capture the market tendency.
- we tried to learn our models on different periods, for example by removing the year 2009 and 2010 which were impacted by the economic crisis.
An insight of our code:
Pandas is very useful to select some values, for example to select only the years we needed:
We built one model per category:
Scikit-learn provides with a large set of machine learning models, very fast and simple to use. The next 3 lines show how we defined our model, trained it and got the prediction:
We did a grid search to compute the best parameters of the random forest.More information about the parameters of the random forest regressor..
As a last step, we ran a post treatment where we kept the min and the max for each model (there are about 4000 differents models of bulldozer in the data) and replaced the price when our prediction was out of the bounds.
Finally, we were very happy to reach the 20th place on the final leaderboard (top 5%)!
So if you want to practice data science, we recommend to try Kaggle. This is fun and you’ll learn a lot. ;)