I just joined Dataiku few weeks ago and have been really impressed by the helpful features easily accessible to anyone. As a marketeer, I had quite a lot of experience using Excel but never really ran predictive models. With the Data Science Studio, I immediately leveled up in data analysis. Find out how I used the studio to predict survival from the sinking of the Titanic.
The Titanic challenge on Kaggle
Kaggle is a platform for predictive modelling competitions. They provide a "Getting Started" competition to gain a first experience in Data Science. This was a great opportunity for me to become a better Analyst (a Data Scientist ?). The challenge is about predicting survival on the Titanic.
In this post, I will show you how I used the Dataiku's Data Science Studio to explore the problem. This is important to have a good overview of the dataset before modelling. I will show in a second post how to run predictions on survival thanks to the Studio.
On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for everyone. Let's find out what sorts of people were more likely to survive.
The Kaggle website provides us with a dataset to train our analysis containing a collection of parameters for 891 passengers (download the train.csv file) :
- Id : a unique number
- Survival : 1=yes, 0=no
- Passenger class : 1=Upper, 2=Middle or 3=Lower
- Name (examples: "Braund, Mr. Owen Harris", "Heikkinen, Miss. Laina")
- Sex : female/male
- Number of Siblings/Spouses Aboard
- Number of Parents/Children Aboard
- Ticket number
- Passenger Fare
- Port of Embarkation
Importing the dataset
Importing the dataset in the Data Science Studio is pretty easy: a single drag-and-drop of the file is required. The studio automatically guesses the charset and other parameters of the file (comma-separated, etc.).
Once imported, we can visualize the data in a spreadsheet-like view; we call it Explore view. It seems that nothing really changed from Excel at this point.
Actually, each column receives a type: it can be a number, a text, a date, etc. It allows us to see quickly if the data is not clean. Here, we don't have any problem (Kaggle provided a well structured dataset).
However, you can actually notice that some columns have missing values. Each column contains a missing value indicator. The Age column has roughly 20% of missing data and the Cabin has 77%.
We can get started in the analysis by looking at the proportion of surviving passengers. It is easy to get by opening the Analyse tool on the Survived column.
342 out of the 891 people in our dataset survived, which corresponds to a 38.4% level. That is not a lot!
By using the Analyse tool on the Age column, we can have a quick overview of the distribution with a histogram and a boxplot.
Repeating these steps on other variables, I can have in a minute a first idea about of my dataset :
- The majority (64.8%) of the passengers is male.
- The majority is between 20 and 40. Median is 28. A quite high proportion is very young (44 passengers out of the 714 that we know the age are below 6 years old).
- 24% are in the upper class, 21% in the middle class, 55% in the third class.
After exploring a bit, we can switch to the visualize view to build our first charts and go deeper in our analysis. Let's explore a bit the relation between Age and Survival. In a few clicks and using drag and drop, we can plot the age of the passengers (split in 10 bins of 10 years) versus the average rate of survival, then versus the number of survivors. The third and fourth plotted graphs add a distinction between male and female among the survivals.
It becomes clear that being a child or a woman was a clear advantage to survive.
Similarly, we could make a graph showing the average survival rate by the passengers' class. We would find :
- In the upper class, 60% of passengers survived.
- In the middle class, 1 out of 2 passengers survived.
- In the third class, it falls down to 25%.
We will carry on our analyse in the modelling part.
Now that we have a good overview of the dataset, we should see if we could enrich our data. It will be profitable for the following predictive model.
In the Explore view, the studio provides a large variety of transformation tools that we call Processors. It is easy to delete or keep columns/rows depending on different values, to transform, parse or replace some textual values, to split cells, to calculate dates, etc. I won't use all of them in our case.
To give a single example, we could extract the abbreviated form of the title in each name.
- "Braund, Mr. Owen Harris" -> Mr.
- "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" -> Mrs.
- "Heikkinen, Miss. Laina" -> Miss.
- "Master. Gosta Leonard" -> Master.
This extraction makes sense: it gives some extra information on the passenger, especially in the case of a missing age. Master and Miss refer to children and young men/women. This information will help for the predictive step later on.
The extraction of the abbreviated title is easy thanks to a double split of the column (on the comma and on the point). Then, we can delete the added columns that we do not need.
That is it ! At this point, we save the database with our enrichment. We will use it for running our predictive model in the following blog post (part two).
To conclude, we saw how easy it was to load a dataset in the Data Science Studio, to get some initial information thanks to different indicators. The data is very well structured, the only problem concerns some missing data. Through quick analysis and charts, we assume that being young or a woman is quite important to have survived the shipwreck. We will go deeper in the analysis in the modelling part.
Note that many other graphics or transformations could have been done. The idea is to show one easy way to explore the data without any programming stuff. Leave a comment if you would have done something in a different way, or if you just want more details, I will be happy to answer back.
Jeremy, a marketing guy learning Data Science.