Easy Text Clustering

data science| tutorial| clustering | | Jeremy Greze

Working on text-based datasets is a different world to dealing with numbers. Comparisons between words are much harder and it may be difficult to group or aggregate similar values. Let's find out more about how to deal with this.

A textual input in your database is probably composed of a sequence of words. If you want to compare two inputs to check how close they are, it is difficult.

  • Words generally have different forms or inflections. A noun can be inflected for a number (singular or plural with the "s"), a verb for a tense (past tense with -ed), etc.
  • Not all words give the same amount of information. Typically, some short function words that we call "stop words" ("a", "the", etc.) are generally meaningless when making phrase comparisons.
  • Misspelled words add a difficulty (read the list of the 100 most often misspelt words in English).

The new Yuzu version of Data Science Studio integrates tools to help you with natural language processing. Today, I present you two tips that I often use.

The AOL dataset

To illustrate this article, I downloaded a publicly available dataset of web queries collected on AOL Search in 2006 (as of 2016, this dataset can no longer be downloaded from infochimps). From the 20M search queries, I extracted 300,000 queries to create the dataset we are going to use.

Each row is a search query issued by an user. The columns are :

  • AnonID: an anonymous user ID;
  • Query: the query -> what we are going to work on;
  • QueryTime: the time at which the query was submitted;
  • ItemRan: the rank of the item the user clicked on;
  • ClickURL: the domain of the link the user clicked on.

This is a really interesting dataset because the queries come from a real case and with misspelling.

Learn more about the dataset in the README file provided.

Let's have a look at the data. We have around 300,000 rows. We see that a text transformation has already been applied to the dataset: all queries are in lower case.

The AOL dataset

The top queries are:

Top queries

Text normalization

The first transformation we are going to do is to simplify the text of the queries. It will help us to compare and regroup similar queries.

Clicking on the "add script" in the interface, we choose the "simplify text" processor. Then we change the settings to transform the 'Query' column into a new column with the following actions:

  • Normalizing text: to transform to lowercase, remove accents and perform Unicode normalization.
  • Stemming words: to transform each word into its 'stem', ie. its grammatical root. For example, 'grammatical' is transformed to 'grammat'.
  • Clearing stop words: to remove so-called 'stop words' (the, I, a, of, etc.).

Text normalization

Top queries are now:

Top Queries Now

Text clustering

The next step is to group or to cluster the similar queries. The idea behind the transformation is to reduce variance among the user queries. Simplifying the query allows us to have more relevant statistics by merging queries that should be considered as the same.

With Data Science Studio, this transformation is very easy. We open the Category clustering window, choose few settings and wait for suggested merge. Then, we choose the clusters to merge.

Text clustering

In our case, it calculated that we have 198,760 distinct values (out of ~300,000 queries). Among the distinct values, it found 1,320 clusters. Here is an overview:

Text clustering

By merging clusters with a set of settings, we can easily reduce the number of distinct values by a few thousand.

Top queries after a two-step clustering:

Top queries after a two-step clustering

That's it! This is a good way to work with text-based datasets. We have other great tools in the studio such as the fuzzy join for joining datasets on an approximate string matching.

Feel free to leave a comment if you want to know more, or request for a demo instance!

Other Content You May Like