Running predictive models is pretty easy in the Data Science Studio. In a few clicks, you have the ability to predict a variable from the data you have. Once you have tried, you will never see your data the same way. You will want to try with anything. Find out how I used the studio to predict survival from the sinking of the Titanic.
This blog post is the second one of a series. In the previous post, I imported the train dataset containing information about 891 passengers of the Titanic. The idea is to find out what sorts of people were more likely to survive the shipwreck.
The exploration of the dataset with the Studio already gave us a good overview:
- 342 out of the 891 people in our dataset survived.
- Being a child or a woman was a clear advantage to survive.
- Being in an upper class is also an advantage to survive.
Find below a screenshot of the dataset we are going to use for the predictions. Remember that we have generated a new column with the title of each passenger.
What's a predictive model ?
The next step is definitely the coolest part of our work. The goal is to predict the Survived variable (1=survived, 0=deceased) of a passenger. We want to find a model that gives the best predictions.
A variety of models can be used. They use techniques of statistics and machine learning. Basically, a model would use all the information available about the passengers to learn what are the characteristics of a survivor. Then, it would output a decision (survive or not) for other passengers (that we would have kept with the objective of testing our model).
To select the best model, we score each one. Two famous metrics are:
- Accuracy: proportion of correct predictions (total number of "good" predictions divided by the total number of predictions)
- AUC: the Area Under the ROC Curve (from 0.5 for a random model to 1 for a perfect model)
Before going further, let's illustrate with a basic case what can be a model. (If you already know what is a predictive algorithm, just skip this section.)
If we keep only the Survived, Age and Sex variables of our passengers and from what we saw in the previous example (ie. using our own sense), we could create a simple tree of decision to predict if a passenger survives:
If we apply this algorithm on a sample of our dataset, we would get something like that:
|40||male||0||0||-> good prediction !|
|18||female||0||1||-> bad prediction|
|24||female||1||1||-> good prediction !|
|7||female||1||1||-> good prediction !|
|47||male||1||0||-> bad prediction|
Accuracy on this sample = 3/5 = 0.6
It is a really basic and naive predictive model. We can expect a bad accuracy score if we run this model on the full dataset. Plus, remember that we have 20% of missing data for the Age variable. Because our algorithm doesn't handle this case, we are already sure that the accuracy cannot be better than a 80%.
The Kaggle website actually published a tutorial to build a similar algorithm with Excel. Even though it is really not a great way to run predictions in reality, it is good to practice.
Running the first prediction
The Data Science Studio provides four predictive models for running a classification of this kind within our intuitive graphical interface: Logistic Regression, Random Forest, Support Vector Machine and Stochastic Gradient Descent. The studio will automatically choose the best parameters for you, which is great when you are a beginner like me.
Let's run our first predictions with the default settings and then explore what we get.
By default, three models were run with adapted settings: two with the Random Forest algorithms (with different parameters) and one with the Logistic Regression. Let's explore results of the second Random Forest model.
We get an accuracy score of 0.8324. It means that our model would predict well survival for 83% of our passengers (it was calculated on a different sample, there is more about it later in this post). It is quite a good result, isn't it?!
The studio automatically pre-processed the data before modelling:
- Some variables with no sense for the prediction (such as the Name) were automatically thrown aside.
- It created dummy variables to measure the effect of each value of some variables (Sex and Title for instance).
- It imputed missing values for the numerical variables (Age).
The chart of variables importance highlights the variables that the algorithm most used to decide whether each passenger survives or not. In our case, the most important elements are: male , Mr. , passenger class, fare and Miss. It seems to reinforce our first analysis from the previous article.
The confusion matrix goes further in the understanding of the predictions. It compares the repartition between the actual values of survival with the predicted values.
A bit of personalization
In order to have a better model, I decide to make some changes in the default parameters choosen by the studio. I drop the PassengerId column (not meaningfull at all for the prediction) and dummify the passenger class variable to count the separate impact of each value.
One last thing you need to know: only 80% of the passengers of our dataset were used to train our model and 20% were kept to test the model. This technique is known as cross validation: we score the model on a different sample that the one used for the training. It is good practice to detect overfitting (one of the worst nightmares of a data scientist I was told).
When you are done with your model training, you should use the full dataset to obtain the best model. Let's do that.
First submission on Kaggle
As said in the previous post, the Titanic problem is part of a competition on Kaggle. Until now, we used a dataset of 891 passengers for whom we know if they survived or not. Kaggle provides another dataset of 418 other passengers without revealing if they survived or not. The challenge is to run our model on this dataset, to export the result in a csv file and then to upload it on the website to get a score.
Let's download the test.csv file and upload on the studio. In a few clicks, we apply the same modifications we did. Once done, we apply our model to the new dataset. A new column appears in the dataset: predicted values.
For these 418 passengers, we now have predictions about the survival. This is the moment you really feel the power of Data Science :-)
The last step is to export the results as a csv file and to upload it on Kaggle to get a score.
And the result is... 78% of our predictions were correct on this dataset. It is quite good !
Making our first predictions did not look too complicated, right ? Leave a comment if something is unclear for you.
We could now work to get a better predictive model but that goes further than the topic of this introduction.
Jeremy, a marketing guy learning Data Science.