Working on text-based datasets is a different world to dealing with numbers. Comparisons between words are much harder and it may be difficult to group or aggregate similar values. Let's find out more about how to deal with this.
A textual input in your database is probably composed of a sequence of words. If you want to compare two inputs to check how close they are, it is difficult.
- Words generally have different forms or inflections. A noun can be inflected for a number (singular or plural with the "s"), a verb for a tense (past tense with -ed), etc.
- Not all words give the same amount of information. Typically, some short function words that we call "stop words" ("a", "the", etc.) are generally meaningless when making phrase comparisons.
- Misspelled words add a difficulty (read the list of the 100 most often misspelt words in English).
The new Yuzu version of Data Science Studio integrates tools to help you with natural language processing. Today, I present you two tips that I often use.
The AOL dataset
To illustrate this article, I downloaded a publicly available dataset of web queries collected on AOL Search in 2006 (as of 2016, this dataset can no longer be downloaded from infochimps). From the 20M search queries, I extracted 300,000 queries to create the dataset we are going to use.
Each row is a search query issued by an user. The columns are :
- AnonID: an anonymous user ID;
- Query: the query -> what we are going to work on;
- QueryTime: the time at which the query was submitted;
- ItemRan: the rank of the item the user clicked on;
- ClickURL: the domain of the link the user clicked on.
This is a really interesting dataset because the queries come from a real case and with misspelling.
Learn more about the dataset in the README file provided.
Let's have a look at the data. We have around 300,000 rows. We see that a text transformation has already been applied to the dataset: all queries are in lower case.
The top queries are:
The first transformation we are going to do is to simplify the text of the queries. It will help us to compare and regroup similar queries.
Clicking on the "add script" in the interface, we choose the "simplify text" processor. Then we change the settings to transform the 'Query' column into a new column with the following actions:
- Normalizing text: to transform to lowercase, remove accents and perform Unicode normalization.
- Stemming words: to transform each word into its 'stem', ie. its grammatical root. For example, 'grammatical' is transformed to 'grammat'.
- Clearing stop words: to remove so-called 'stop words' (the, I, a, of, etc.).
Top queries are now:
The next step is to group or to cluster the similar queries. The idea behind the transformation is to reduce variance among the user queries. Simplifying the query allows us to have more relevant statistics by merging queries that should be considered as the same.
With Data Science Studio, this transformation is very easy. We open the Category clustering window, choose few settings and wait for suggested merge. Then, we choose the clusters to merge.
In our case, it calculated that we have 198,760 distinct values (out of ~300,000 queries). Among the distinct values, it found 1,320 clusters. Here is an overview:
By merging clusters with a set of settings, we can easily reduce the number of distinct values by a few thousand.
Top queries after a two-step clustering:
That's it! This is a good way to work with text-based datasets. We have other great tools in the studio such as the fuzzy join for joining datasets on an approximate string matching.
Feel free to leave a comment if you want to know more, or request for a demo instance!