Kaggle Contest: Blue Book For Bulldozers

Use Cases & Projects Matt Scordia

Perhaps you know Kaggle and its slogan “making data science a sport”? Kaggle is a cool platform for predictive modeling competitions where the best data scientists face each other, all trying to improve their models' performance by 0.01 of a point.

At Dataiku, we love challenges so we jumped at the chance of competing in one of these contests: the Blue Book for Bulldozers. Here is what we did.

Blue Book for Bulldozers: The Challenge

The goal of this contest was to predict the sale price of a bulldozer based on its model, age, description, and a bunch of other options.

Two datasets were released:

  • the train set, with 400 000 auctions from 1989 to April 2012 (sale price known)
  • the test set, with 27 000 auctions from May 2012 to November 2012, for which we had to predict the price

We worked with Python and two cool libraries: scikit-learn for the machine learning algorithms, and pandas for data processing.

After a short round of data exploration, we quickly understood that there were different categories of bulldozers with significant variations of price. So our main idea was to use a random forest on each category of bulldozer.

A random what? 

The first difficulty was the data itself: 53 features with a lot of missing or erroneous values. Real life data science... For example, some trucks were sold before their year of manufacture.

Others were sold more than once, so a machineID could appear many times in the train set, and the description could change between these different rows.

Kaggle provided us with a machine appendix with the “real” value of each feature and for each machine, but it turned out that putting in the true value was not a good idea. Indeed, we think that each seller could declare the characteristics (or not) on the auction website and this had an impact on the price.

As for the second point, we focused on the volatility of some models. We spent a lot of time trying to understand how a machine could be sold the same year, and, even with only a few days between two sales, at two completely different prices. It turned out not to be easily predictable. In financial theory, the model used to describe this kind of randomness is call random walk.

Our Solution

We tried a lot of things:

  • we decomposed each option in new binary features
  • we added the age from the sale date and the year of manufacture
  • we added the day of the week, the number of the week in the year
  • we also tried to add the number of auctions of the current month to try to capture the market tendency
  • we tried to learn our models on different periods, for example by removing the year 2009 and 2010 which were impacted by the economic crisis

An insight on our code:

Pandas is very useful for selecting some values, for example to select only the years we needed:

filtered = train[(train.YearMade <= 2008) | (train.YearMade > 2010)]

We built one model per category:

category = np.unique(train['ProductGroup'])

for v in category:
      ind=(train['ProductGroup'] == v)
      train=train[ind]
      target=target[ind]

Scikit-learn provides us with a large set of machine learning models, very fast and simple to use. The next 3 lines show how we defined our model, trained it and got the prediction:

rf = RandomForestRegressor(n_estimators=150, compute_importances = True)
rf.fit(train, target)
predictions = rf.predict(test)

We did a grid search to compute the best parameters of the random forest.

rf=RandomForestRegressor(n_jobs=8,compute_importances = True)
parameters={
    'n_estimators':[50,100,150,200,300],
    'max_features':['auto','sqrt','log2'],
    'min_samples_split':[2,10],
    'min_samples_leaf':[1,10]
}
clf=GridSearchCV(rf,parameters)

Here is some more information about the parameters of the random forest regressor. As a last step, we ran a post treatment where we kept the min and the max for each model (there are about 4000 differents models of bulldozer in the data) and replaced the price when our prediction was out of the bounds.

Finally, we were very happy to reach the 20th place on the final leaderboard (top 5%)!

If you want to practice data science, we recommend trying Kaggle as well. It's fun and you’ll learn a lot! And while you're at it, why not download a free version of Dataiku and try to reproduce this project (or make it oven better)?

You May Also Like

Taming LLM Outputs: Your Guide to Structured Text Generation

Read More

No-Code ML and GenAI With Dataiku and Fabric

Read More

The Objects of an LLM Mesh for Building LLM-Powered Applications

Read More

Data Lineage: The Key to Impact and Root Cause Analysis

Read More