Here’s a brief description of a Dataiku marketers first Kaggle competition - and remember, this Dataiku marketer is me, and I'm no techy.
For those of you who already read my latest blog post (“My First Three Weeks as a Dataiku Marketer" you already know that my very first interaction with the data world was the day I joined Dataiku and started the DSS tutorials. As I don’t speak Python or R (yet), I'm still only using DSS’s visual interface for my personal projects.
Three weeks ago, Dataiku announced the release of DSS V2. As you can imagine, the whole Dataiku team was super excited not only about the announcement but especially about using DSS V2. That's why on May 20th, Kenji, Dataiku’s Product Manager, gave the sales and marketing team a demo and presentation of DSS’s new look, new feel, and new functionalities. Within minutes of Kenji finishing his presentation, we decided to start testing DSS V2 for ourselves. And what better way to test a data science tool than by competing in a Kaggle Competition? That's when we started the West Nile Kaggle challenge - for which the goal is to predict West Nile virus mosquitos in the city of Chicago.
Playing with Data Science Studio
Gathered in a conference room, in a pleasantly competitive atmosphere, we began playing with DSS. I insist on the word playing because it really did feel like a game.
At first, I was a little scared. I had datasets, I had DSS. Ok... but I had to figure out what to do next. I already knew I wasn't competing for the top rank but I didn’t want to be the dumb kid in class either. I carefully read the Kaggle indications, studied the datasets, and decided to go about it one step at a time. So here’s a brief description of a Dataiku marketers first Kaggle competition - and remember, this Dataiku marketer is me, and I'm no techy.
Step 1: Cleaning and Enriching
With DSS it was really easy to import the datasets and to immediately start cleaning the data without a single line of code. The challenge offered multiple datasets but, for my first submission, I decided to use 2 datasets:
- The first one was the "spray dataset": this dataset represents the areas that were sprayed in order to kill the mosquitoes and to iradicate the virus. I parsed the dates as suggested by DSS. Then, thanks to a DSS processor that creates a geopoint from latitude and longitude, I created a geopoint column.
- The second dataset was the "train dataset," which is also the main dataset: it shows the areas (with latitude and longitude) where the city of Chicago placed traps and how many mosquitos were found in each trap. In this dataset, I duplicated some of the same preparation steps (parse the date and create a geopoint) that I’d used in the spray dataset. Finally, thanks to a suggestion by Marc, our CCO, I aggregated the lines that were "duplicated" when more than 50 mosquitos where found in a trap.
Thanks to the geopoint column I’d created in both datasets, I used the DSS Join Recipe to join the spray and train datasets.
Step 2: Building a Model and Submitting a Prediction
Kurt, businessman by day and geek by night, proceeded to give me another helpful tip: retrieve and remove un-used columns from the test dataset. I therefore decided to remove the addresses column because it wasn't in the dataset I used to train the model and because a model cannot be applied to information it has never been trained on.
DSS offers multiple options to build models and includes algorithms from the open source library Scikit Learn. After testing a few algorithms including logistic regression, I noticed that the AUC was higher with Random Forest. Therefore, I chose to use a Randome Forest model in order to predict the appearance of the Nile virus. I trained the model on the new dataset (result of the join and spray dataset I had previously cleaned).
Then, I proceded to parse dates and to create a geoppoint column on the test dataset. Finally, I applied the model on the new test dataset and submitted my prediction.
The final workflow looked like this:
And here is my rank, for a first submission I’m not top ranked as expected but I’m not the lowest ranked either. Not going to lie: I’m proud of myself!
If you want to go further, I definitely suggest you read this blog post by Henri, a Dataiku Data Scientist, who was ranked 60th for his first submission (over 411 teams).