XGBoost in Data Science Studio

data science| machine learning| tutorial | | Matthieu Scordia

XGBoost, you know this name if you're familiar with machine learning competitions. It's the algorithm you want to try: it's very fast, effective, easy to use, and comes with very cool features. If you're new to machine learning, check out this article on why algorithms are your friend. In this post, I'llI explain how you can use it in DSS without coding and some more advanced things you can do with Python code.

Since DSS 3.0, XGBoost has been natively integrated into DSS virtual machine learning, meaning you can train XGBoost models without writing any code.
You'll find more information about how to use XGBoost in visual machine learning in the reference documentation.

The rest of this post covers how to use XGBoost manually, for DSS 2.X

XGBoost is a gradient boosting tree method. Gradient, because it uses gradient descent, is a way to find a local minimum of a function: the algorithm follows the path of the descent. Boosting is a technique which is based on the fact that a set of weak learners is stronger than a single strong learner.  Explore Tianqi Chen presentation to grasp the theoretical aspect.


XGBoost isn't in DSS by default, you have to get it on github, compile it, and install the python module on the python of DSS.

In your server console:

git clone https://github.com/dmlc/xgboost.git
cd xgboost
/pathto/dss_home/bin/pip install -e /pathto/xgboost/python-package/ 

To check if everything works fine you can try to import xgboost in an ipython notebook.

XGBoost in a custom model.

Take a dataset and in your analysis, go to Models / Settings / Algorithms and add a custom model. Here you'll have to write 3 lines of code to specify the model name, import xgboost and create your classifier:

xgboost custom python

  • XGBClassifier if your target binary.
  • XGBRegressor if you do regression.

Let's compare it to scikit learn Gradient Boosting with both default parameter:

comparison xgboost gradient boosting

Same R2 score but XGBoost was trained in 20 seconds against 5 minutes for the scikit learn GBT!

You can now deploy it like another model in DSS but maybe you'll want to change the default parameters to optimize your score!


You can set some parameters to your XGBoost:

max_depth : int
    Maximum tree depth for base learners.
learning_rate : float
    Boosting learning rate (xgb's "eta")
n_estimators : int
    Number of boosted trees to fit.
silent : boolean
    Whether to print messages while running boosting.
objective : string
    Specify the learning task and the corresponding learning objective.
nthread : int
    Number of parallel threads used to run xgboost.
gamma : float
    Minimum loss reduction required to make a further partition 
    on a leaf node of the tree.
min_child_weight : int
    Minimum sum of instance weight(hessian) needed in a child.
max_delta_step : int
    Maximum delta step we allow each tree's weight estimation to be.
subsample : float
    Subsample ratio of the training instance.
colsample_bytree : float
    Subsample ratio of columns when constructing each tree.
    The initial prediction score of all instances, global bias.
seed : int
    Random number seed.
missing : float, optional
    Value in the data which needs to be present as a missing value. 
    If None, defaults to np.nan.

You have 2 ways to control overfitting in xgboost:

  • Control the model complexity with max_depth, min_child_weight and gamma.
  • Add randomness to make training robust to noise with subsample and colsample_bytree.

Sparse matrix.

Xgboost can take in input sparse matrix. That's very useful because when you have categorical variables with high cardinality, you can convert them into dummies matrix without being out of memory!

For this, I use a python function:

from pandas.core.categorical import Categorical
from scipy.sparse import csr_matrix
import numpy as np

def sparse_dummies(categorical_values):
    categories = Categorical.from_array(categorical_values)
    N = len(categorical_values)
    row_numbers = np.arange(N, dtype=np.int)
    ones = np.ones((N,))
    return csr_matrix( (ones, (row_numbers, categories.codes)) )


This returns a sparse matrix of 3 columns, one by value of VAR_0001:

    <145231x3 sparse matrix of type '<type 'numpy.float64'>'
        with 145231 stored elements in Compressed Sparse Row format>

You can concatenate this matrix with other dummy matrices with the scipy hstack function:

from scipy.sparse import hstack
cat1 = sparse_dummies(df.VAR_0001)
cat2 = sparse_dummies(df.VAR_0002)
hstack((cat1,cat2), format="csr")
    <145231x7 sparse matrix of type '<type 'numpy.float64'>'
        with 290462 stored elements in Compressed Sparse Row format>

Early stopping.

A really cool feature is early stopping. As you are going to learn more and more trees, you will overfit your training dataset. Early stopping able you to specify a validation dataset and the number of iteration the algorithm should stop if the score on your validation dataset didn't increase.

To use it, you can specify in the fit method of the classifier an evaluation set, an evaluation method and the early stopping round number:

clf = xgb.XGBClassifier(n_estimators=10000)
eval_set  = [(train,y_train), (valid,y_valid)]
clf.fit(train, y_train, eval_set=eval_set, 
        eval_metric="auc", early_stopping_rounds=30)

I set explicitly the n_estimators to a very large number. In your job log you'll see the score increasing on the dataset you put in the eval_set list:

    Will train until validation_1 error hasn't decreased in 30 rounds.
    [0]    validation_0-auc:0.733451   validation_1-auc:0.698659
    [1]    validation_0-auc:0.776699   validation_1-auc:0.731099
    [2]    validation_0-auc:0.789156   validation_1-auc:0.740601
    [3]    validation_0-auc:0.792534   validation_1-auc:0.744378
    [4]    validation_0-auc:0.800747   validation_1-auc:0.748260
    [5]    validation_0-auc:0.805586   validation_1-auc:0.750209
    [6]    validation_0-auc:0.810889   validation_1-auc:0.752157
    [7]    validation_0-auc:0.812459   validation_1-auc:0.752554
    [8]    validation_0-auc:0.812928   validation_1-auc:0.752733
    [9]    validation_0-auc:0.813815   validation_1-auc:0.753650
    [10]   validation_0-auc:0.814547   validation_1-auc:0.753750
    [271]  validation_0-auc:0.897922   validation_1-auc:0.782187
    [272]  validation_0-auc:0.898150   validation_1-auc:0.782179
    [273]  validation_0-auc:0.898150   validation_1-auc:0.782179
    [274]  validation_0-auc:0.898439   validation_1-auc:0.782225
    [275]  validation_0-auc:0.898439   validation_1-auc:0.782225
    [276]  validation_0-auc:0.898591   validation_1-auc:0.782219
    Stopping. Best iteration:
    [246]  validation_0-auc:0.894087   validation_1-auc:0.782487

Note that you can define your own evaluation metric instead.

Features importance.

You can get the features importance easily in clf.booster().get_fscore()where clf is your trained classifier.

In a iPython notebook, I use this code to see it:

features = [ "your list of features ..." ]
mapFeat = dict(zip(["f"+str(i) for i in range(len(features))],features))
ts = pd.Series(clf.booster().get_fscore())
ts.index = ts.reset_index()['index'].map(mapFeat)
ts.order()[-15:].plot(kind="barh", title=("features importance"))

features importance

Hyperopt to gridsearch.

Fine-tuning your XGBoost can be done by exploring the space of parameters possibilities. For this task, I'm using the package Hyperopt. Hyperopt is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.

Here an exemple of a python recipe to use it:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from sklearn.metrics import roc_auc_score
import xgboost as xgb
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials 

train = dataiku.Dataset("train").get_dataframe()
valid = dataiku.Dataset("valid").get_dataframe()

y_train = train.target
y_valid = valid.target

del train["target"]
del valid["target"]

def objective(space):

    clf = xgb.XGBClassifier(n_estimators = 10000, 
                            max_depth = space['max_depth'],
                            min_child_weight = space['min_child_weight'],
                            subsample = space['subsample'])

    eval_set  = [( train, y_train), ( valid, y_valid)]

    clf.fit(train[col_train], y_train,
            eval_set=eval_set, eval_metric="auc", 

    pred = clf.predict_proba(valid)[:,1]
    auc = roc_auc_score(y_valid, pred)
    print "SCORE:", auc

    return{'loss':1-auc, 'status': STATUS_OK }

space ={
        'max_depth': hp.quniform("x_max_depth", 5, 30, 1),
        'min_child_weight': hp.quniform ('x_min_child', 1, 10, 1),
        'subsample': hp.uniform ('x_subsample', 0.8, 1)

trials = Trials()
best = fmin(fn=objective,

print best

After loading your datasets of training and validation, I define my objective function.

This function trains a model, evaluates it and returns the error on the validation set. I define the space I want to explore, here I want to try values from 5 to 30 for max_depth, from 1 to 10 for min_child_weight and from 0.8 to 1 for subsample.

Hyperopt will minimize this error in a maximum of 100 experiments.

Here's more documentation: hyperopt

That's it! If you want to talk more about xgboost or anything related to data science, send me an email!

Try Dataiku


Other Content You May Like