Predict Breast Cancer Metastasis With Dataiku

The past decade has seen a dramatic fall in the price of a human genome, and there are amazing open-source databases filled with genomic information, so anyone can access terabytes of genomic data. We'll use this to build a predictive model of breast cancer metastasis with Dataiku DSS.

I'm going to look at one of these databases: The Cancer Genome Atlas. Specifically, I'm going to look at the mutations that happen in breast cancer. I'm going to do two specific things with this data. First, I'm going to run a clustering algorithm on the cancers to see if there are different types of breast cancer. Then, I'm going to build a predictive model for cancer recurrence, otherwise known as metastasis.

*Note: This blog post contains screenshots from an older version of Dataiku, though all of the functionality described still exists. Watch the on-demand demo to take a look at the latest release of Dataiku in action.

Data Preparation

The data that I'm working comes in two files. I have a .csv file with the 200 most common mutations that occur in breast cancer. Patients are represented by rows in the dataset, while the columns represent mutations with either "true" or "false" if the mutation is present. I also have clinical information on each of the patients: their treatments, when they were diagnosed, their outcomes, etc.

Data Preparation

Clustering Cancers Based on Genetics

First, I wanted to cluster the cancers based on their mutations to see if there are genetically distinct types of cancers. Doing this is super easy with Dataiku. I clicked on the common_mutations dataset, which brings up the data explorer. Then, I clicked on Models.

Clustering Cancers Based On Genetics

Right now, I just want to explore the dataset with clustering, so I clicked on clustering to do so.

Clustering Cancers Based On Genetics

Dataiku automatically chooses an appropriate clustering algorithm. For this dataset, it went with k-means clustering, which is a super commonly used method. You can go into settings to choose from several different algorithms.

Dataiku also produces a summary of each of the clusters, which is really interesting for this particular data set. We see three large clusters: cluster0, cluster2 and cluster4. Clicking on cluster0 brings up a summary.

Clustering Cancers Based On Genetics

We see that cancers in this cluster at almost all positive for TP53 mutations, while almost all of them are negative for PIK3CA mutations. These are two genes that are known to cause cancer, also known as oncogenes.

In contrast, we can look at cluster_2:

Clustering Cancers Based On Genetics

These cancers are almost all negative for TP53 mutations, while they are almost all positive for PIK3CA mutations. This is the exact opposite of cluster_0!

Finally, we can look at cluster_4:

Clustering Cancers Based On Genetics

This cluster is negative for both TP53 and PIK3CA mutations. These are the remaining cancers that don't fit into the two other categories. So, with just a few clicks, we were able run a clustering analysis on hundreds of cancers. We found three main types of cancer: those caused by TP53 mutations, those caused by PIK3CA mutations, and then a group of remaining cancers. Now, let's do some predictive modeling!

Predictive Modeling of Metastasis

I'm most interested in building a predictive model of cancer recurrence. In general, having a recurrence is really bad for a cancer patient, so clinicians are often most interested in predicting this variable.

In order to build this model, I need to join the table containing the mutations with the table containing the patients' clinical information. Fortunately, joins are really easy with Dataiku. To do a join, I click on the mutation dataset and then the join widget on the right of the screen. This will up a dialog box where I can choose the second dataset.

Then, Dataiku wants you to specify the column which pairs rows in one dataset to rows in the other dataset. In this case, it was the identifier for each patient.