Easy Text Clustering + NLP with Dataiku

Dataiku Product Jeremy Greze

Working on text-based datasets is a different world than dealing with numbers. Comparisons between words are much harder, and it may be difficult to group or aggregate similar values. The new Yuzu version of Dataiku integrates tools to help you with natural language processing (NLP). This blog post presents two tips that I often use when working with text-based datasets.

*Note: This blog post shows screen shots from an older version of Dataiku. To see the latest version in action, check out the demo.

A textual input in your database is probably composed of a sequence of words. If you want to compare two inputs to check how close they are, it is difficult because:

  • Words generally have different forms or inflections. A noun can be inflected for a number (singular or plural with the "s"), a verb for a tense (past tense with -ed), etc.
  • Not all words give the same amount of information. Typically, some short function words that we call "stop words" ("a," "the," etc.) are generally meaningless when making phrase comparisons.
  • Misspelled words add a layer of difficulty (and even though there are lists of the 100 most often misspelled words in English, misspellings can occur anywhere).

The AOL Dataset

To provide an example for this article, I downloaded a publicly available dataset of web queries collected on AOL Search in 2006 (note that as of 2016, this dataset can no longer be downloaded from infochimps, but you can learn more about the dataset in the README file). From the 20M search queries, I extracted 300,000 queries to create the dataset we are going to use.

Each row is a search query issued by a user. The columns are:

  • AnonID: an anonymous user ID
  • Query: the query -> what we are going to work on
  • QueryTime: the time at which the query was submitted
  • ItemRan: the rank of the item the user clicked on
  • ClickURL: the domain of the link the user clicked on

This is a really interesting dataset because the queries come from a real case and with misspellings. Let's have a look at the data: we have around 300,000 rows. We see that a text transformation has already been applied to the dataset: all queries are in lower case.

The AOL dataset presented in Dataiku DSS v1.1

The top queries are:

Top queries for the AOL dataset

Text Normalization

The first transformation we are going to do is to simplify the text of the queries, which will help us compare and regroup similar queries. By clicking on "add script" in Dataiku,  we can choose the "simplify text" processor. Then we can change the settings to transform the Query column into a new column with the following actions:

  • Normalizing text: Transform to lowercase, remove accents, and perform Unicode normalization.
  • Stemming words: Transform each word into its "stem," i.e., its grammatical root. For example, grammatical would be transformed to "grammat."
  • Clearing stop words: Remove so-called stop words (the, I, a, of, etc.).

gif of Text normalization in Dataiku DSS v1.1

Top queries are now:

Top Queries in AOL dataset after text normalization

Text Clustering

The next step is to group or to cluster the similar queries. The idea behind this transformation is to reduce variance among the user queries. Simplifying the query allows us to have more relevant statistics by merging queries that should be considered as the same queries.

With Dataiku, this transformation is very easy. We open the Category clustering window, choose a few settings, and wait for suggested merge. Then, we choose the clusters to merge:

gif of Text clustering in Dataiku DSS v1.1

In our case, it calculated that we have 198,760 distinct values (out of ~300,000 queries). Among the distinct values, it found 1,320 clusters. Here is an overview:

overview of query analysis after Text clustering

By merging clusters with a set of settings, we can easily reduce the number of distinct values by a few thousand.

Top queries after a two-step clustering:

Top queries after a two-step clustering

That's it! This is a good way to work with text-based datasets. We have other great tools in Dataiku such as the fuzzy join for joining datasets on an approximate string matching.

You May Also Like

AI Isn't Taking Over, It's Augmenting Decision-Making

Read More

Maximize GenAI Impact in 2025 With Strategy and Spend Tips

Read More

Looking Ahead: AI Hurdles IT Leaders Need to Overcome in 2025

Read More

No-Code ML and GenAI With Dataiku and Fabric

Read More