From Marketing + Excel to Data Science: First Steps With the Titanic Kaggle

As a marketer, I had quite a lot of experience using Excel but never really ran predictive models. Kaggle - a platform for predictive modeling competitions - provides a "Getting Started" competition, which was a great opportunity for me to level up my analyst skills. In this post, I will show you how I used Dataiku to explore the Titanic challenge problem, an important first step to make future predictions better.

* Note: This blog post contains screen shots from an older version of Dataiku, though all of the functionality described still exists. Watch the on-demand demo to take a look at the latest release of Dataiku in action.

drawing of sinking titanic
The Kaggle Titanic challenge is about predicting survival on the Titanic. As you probably know, one of the reasons that the Titanic shipwreck led to such loss of life was that there were not enough lifeboats for everyone. Let's find out what sorts of people were more likely to survive.

The Kaggle website provides us with a dataset to train our analysis containing a collection of parameters for 891 passengers (download the train.csv file):

Id : a unique number
Survival : 1=yes, 0=no
Passenger class : 1=Upper, 2=Middle or 3=Lower
Name (examples: "Braund, Mr. Owen Harris", "Heikkinen, Miss. Laina")
Sex : female/male
Age
Number of Siblings/Spouses Aboard
Number of Parents/Children Aboard
Ticket number
Passenger Fare
Cabin
Port of Embarkation

Analyzing the Titanic Dataset in Dataiku

titanic tutorial project in Dataiku

Importing the Dataset

Importing the dataset in Dataiku is pretty easy: a single drag-and-drop of the file is required, and from there, Dataiku automatically guesses the charset and other parameters of the file (comma separated, etc.).

Once imported, we can visualize the data in a spreadsheet-like view (called Explore in Dataiku). It seems that nothing has really changed too much from Excel at this point. The biggest difference is that in Dataiku, each column receives a type: it can be a number, a text, a date, etc., and this allows us to quickly see if the data is not clean. Here, we don't have any problem, as Kaggle provided a relatively well-structured dataset.

Overview of the Titanic challenge dataset in Dataiku

However, you can see that some columns have missing values (in Dataiku, each column contains a missing value indicator). For example, the Age column has roughly 20% of missing data and the Cabin column 77 % missing.

Indicator of missing data in Dataiku

First Analysis

We can dive into the analysis by looking at the proportion of surviving passengers, which is easy to get in Dataiku by opening the Analyze tool on the Survived column. It looks like 342 out of the 891 people in our dataset survived, which corresponds to a 38.4% (not a lot...).

Gif of the quick Analyze tool in Dataiku

By using the Dataiku Analyze functionality on the Age column as well, we can see a quick overview of the passenger age distribution with a histogram and a box plot.

Dataiku Analyze functionality distribution and box plot

Repeating these steps on other variables, I can have a good overall idea in under a few minutes of what's in my dataset:

The majority (64.8%) of the passengers are male.
The majority are between 20 and 40 years old (the median is 28). Quite a high proportion is also very young (44 passengers out of the 714 passengers for which we know the age are below 6 years old).
24% had first-class tickets, 21% middle class, and 55% in the third class.

Data Visualization

After exploring a bit, we can switch to the visualize view in Dataiku to build our first charts and go deeper in our analysis. Let's explore the relationship between age and survival more deeply. In a few clicks and using drag-and-drop functionality in Dataiku, we can plot the age of the passengers (split in 10 bins of 10 years) versus the average rate of survival, then versus the number of survivors. The third and fourth plotted graphs add a distinction between male and female among the survivals. With this quick visualization, it becomes clear that being a child or a woman was a clear advantage to survive:

quick visualization capabilities in Dataiku

titanic kaggle average survival rate by groups of age visualized in Dataiku titanic kaggle total number of survivors by groups of age visualized in Dataiku

Similarly, we could make a graph showing the average survival rate by the passengers' class. We would find :

In first class, 60% of the passengers survived.
In middle class, 1 out of 2 passengers survived.
In third class, it falls down to 25%.

Data Transformation

Now that we have a good overview of the dataset, we should see how we can enrich our data. This will give our predictive models more data to work with later.

In the Explore view, Dataiku provides a large variety of transformation tools that we call Processors. It is easy to:

Delete or keep columns/rows depending on different values
Transform, parse or replace some textual values
Split cells
Calculate dates, etc.

I won't use all of this functionality in our case, but to give a single example, we could extract the abbreviated form of the title in each name.

"Braund, Mr. Owen Harris" -> Mr.
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)" -> Mrs.
"Heikkinen, Miss. Laina" -> Miss.
"Master. Gosta Leonard" -> Master.

This extraction makes sense: it gives some extra information on the passenger, especially in the case of a missing age. Master and Miss refer to children and young men/women. This information will help for the predictive step later on.

The extraction of the abbreviated title is easy in Dataiku thanks to a double split of the column (on the comma and on the point). Then, we can delete the added columns that we do not need:

visual text transformation in Dataiku

That is it! At this point, we'll deploy the script in Dataiku to save our dataset for the next part. We will use it to run our predictive model in the following blog post (part two).

To conclude, we saw how easy it was to load a dataset in Dataiku as well as to get some initial information thanks to different indicators. The data is very well structured, so the only problem concerns some missing data. Through quick analysis and charts, we assume that being young or a woman is quite a significant indicator to survival of the shipwreck. We will go deeper in the analysis in part two. Note that many other graphics or transformations could have been done. The idea is to show one easy way to explore the data without any programming!

From Marketing + Excel to Data Science: First Steps With the Titanic Kaggle