As a marketeer, I had quite a lot of experience using Excel but never really ran predictive models.Find out how I used the studio to predict survival from the sinking of the Titanic.
When I joined Dataiku I was really impressed by the helpful features easily accessible to anyone. With Dataiku Data Science Studio, I immediately leveled up in data analysis. My first big project was working on the dataset of the Titanic challenge on Kaggle.
A Great Start: the Titanic challenge on Kaggle
Kaggle is a platform for predictive modelling competitions. They provide a "Getting Started" competition to gain a first experience in Data Science with Titanic Kaggle. This was a great opportunity for me to become a better Analyst (a future Data Scientist ?). The challenge is about predicting survival on the Titanic.
On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for everyone. Let's find out what sorts of people were more likely to survive.
In this post, I will show you how I used Dataiku Data Science Studio to explore the problem. This is an important first step to make future predictions better.I will show in a second post howto run predictions of who survives the Titanic thanks to DSS.
The Kaggle website provides us with a dataset to train our analysis containing a collection of parameters for 891 passengers (download the train.csv file) :
- Id : a unique number
- Survival : 1=yes, 0=no
- Passenger class : 1=Upper, 2=Middle or 3=Lower
- Name (examples: "Braund, Mr. Owen Harris", "Heikkinen, Miss. Laina")
- Sex : female/male
- Number of Siblings/Spouses Aboard
- Number of Parents/Children Aboard
- Ticket number
- Passenger Fare
- Port of Embarkation
Analyzing the Titanic Dataset in Dataiku Data Science Studio
Importing the dataset
Importing the dataset in the Data Science Studio is pretty easy: a single drag-and-drop of the file is required. The studio automatically guesses the charset and other parameters of the file (comma-separated, etc.).
Once imported, we can visualize the data in a spreadsheet-like view; we call it Explore view. It seems that nothing really changed from Excel at this point.
Actually, each column receives a type: it can be a number, a text, a date, etc. It allows us to see quickly if the data is not clean. Here, we don't have any problem (Kaggle provided a well structured dataset).
However, you can actually notice that some columns have missing values. Each column contains a missing value indicator. The Age column has roughly 20% of missing data and the Cabin has 77%.
We can get started in the analysis by looking at the proportion of surviving passengers. It is easy to get by opening the Analyse tool on the Survived column.
342 out of the 891 people in our dataset survived, which corresponds to a 38.4% level. That is not a lot!
By using the Analyse tool on the Age column, we can have a quick overview of the distribution with a histogram and a boxplot.
Repeating these steps on other variables, I can have a first idea of what's in my dataset in a minute :
- The majority (64.8%) of the passengers is male.
- The majority is between 20 and 40. Median is 28. A quite high proportion is very young (44 passengers out of the 714 that we know the age are below 6 years old).
- 24% are in the upper class, 21% in the middle class, 55% in the third class.
After exploring a bit, we can switch to the visualize view to build our first charts and go deeper in our analysis. Let's explore a bit the relation between Age and Survival. In a few clicks and using drag and drop, we can plot the age of the passengers (split in 10 bins of 10 years) versus the average rate of survival, then versus the number of survivors. The third and fourth plotted graphs add a distinction between male and female among the survivals.
It becomes clear that being a child or a woman was a clear advantage to survive.
Similarly, we could make a graph showing the average survival rate by the passengers' class. We would find :
- In the upper class, 60% of passengers survived.
- In the middle class, 1 out of 2 passengers survived.
- In the third class, it falls down to 25%.
We will carry on our analyses in the modeling part.
Now that we have a good overview of the dataset, we should see how we can enrich our data. This will give our predictive models more data to work with later.
In the Explore view, the studio provides a large variety of transformation tools that we call Processors. It is easy to delete or keep columns/rows depending on different values, to transform, parse or replace some textual values, to split cells, to calculate dates, etc. I won't use all of them in our case.
To give a single example, we could extract the abbreviated form of the title in each name.
- "Braund, Mr. Owen Harris" -> Mr.
- "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" -> Mrs.
- "Heikkinen, Miss. Laina" -> Miss.
- "Master. Gosta Leonard" -> Master.
This extraction makes sense: it gives some extra information on the passenger, especially in the case of a missing age. Master and Miss refer to children and young men/women. This information will help for the predictive step later on.
The extraction of the abbreviated title is easy thanks to a double split of the column (on the comma and on the point). Then, we can delete the added columns that we do not need.
That is it ! At this point, we deploy the script to save our dataset for the next part. We will use it to run our predictive model in the following blog post (part two).
To conclude, we saw how easy it was to load a dataset in the Data Science Studio, to get some initial information thanks to different indicators. The data is very well structured, the only problem concerns some missing data. Through quick analysis and charts, we assume that being young or a woman is quite important to survive the shipwreck. We will go deeper in the analysis later.
Note that many other graphics or transformations could have been done. The idea is to show one easy way to explore the data without any programming stuff. Get in touch with me if you would have done something in a different way, or if you just want more details, I will be happy to answer back.
Have a go at the Titanic dataset yourself and download Dataiku Data science Studio.