Perhaps you know Kaggle and its slogan “making data science a sport”?

Kaggle is a cool platform for predictive modeling competitions where the best data scientists face each other, all trying to improve their models' performance by 0.01 of a point.

At Dataiku, we love challenges so we jumped at the chance of competing in one of these contests: the blue book for Bulldozers.

So here is what we did.

The goal of this contest was to predict the sale price of a bulldozer based on its model, age, description, and a bunch of other options.

Two datasets were released:

• the train set, with 400 000 auctions from 1989 to April 2012 (sale price known)
• the test set, with 27 000 auctions from May 2012 to November 2012, for which we had to predict the price

We worked with Python and two cool libraries: scikit-learn for the machine learning algorithms, and pandas for data processing.

After a short round of data exploration, we quickly understood that there were different categories of bulldozers with significant variations of price. So our main idea was to use a random forest on each category of bulldozer.

A random what? Here's a great explanation of the random forest concept.

The first difficulty was the data itself: 53 features with a lot of missing or erroneous values. Real life data science... For example, some trucks were sold before their year of manufacture.

Others were sold more than once, so a machineID could appear many times in the train set, and the description could change between these different rows.

Kaggle provided us with a machine appendix with the “real” value of each feature and for each machine, but it turned out that putting in the true value was not a good idea. Indeed, we think that each seller could declare the characteristics (or not) on the auction website and this had an impact on the price.

As for the second point, we focused on the volatility of some models. We spent a lot of time trying to understand how a machine could be sold the same year, and, even with only a few days between two sales, at two completely different prices. It turned out not to be easily predictable. In financial theory, the model used to describe this kind of randomness is call random walk.

We tried a lot of things:

• we decomposed each option in new binary features
• we added the age from the sale date and the year of manufacture
• we added the day of the week, the number of the week in the year
• we also tried to add the number of auctions of the current month to try to capture the market tendency
• we tried to learn our models on different periods, for example by removing the year 2009 and 2010 which were impacted by the economic crisis

An insight on our code:

Pandas is very useful for selecting some values, for example to select only the years we needed:

We built one model per category:

Scikit-learn provides us with a large set of machine learning models, very fast and simple to use. The next 3 lines show how we defined our model, trained it and got the prediction:

We did a grid search to compute the best parameters of the random forest.