This tutorial will show how (at a very high level) Non-negative Matrix Factorization(NMF) applied to a matrix of Term Frequency-Inverse Document Frequency (TF-IDF) can be used to extract topics from large collections of text.

In a previous post, I introduced how we can use Markov Chains to generate novelty Clinton and Trump speeches. In this post, however, I want to show you that we can do more serious and useful stuff with text data.

This technique amounts to shrinking a bunch of text documents into a lower number of archetype documents (or topics) that are described by relevant key words. In other words, NMF/TF-IDF is pretty much like your Sparknotes friend who can answer the following question: "hey dude, can you group Sharkespeare's work into five main topics and describe each topic using their 10 most relevant words?"

It's not the first time we're posting about NMF/TF-IDF. In fact, in a previous blog post, Leo already used this technique to discover topics in Queen Elizabeth's Christmas Eve speeches.

Yet, because NMF/TF-IDF is such a nice litte tool, I thought I would use it once more to analyze something that couldn't be more relevant: speeches from U.S. presidential candidates Hillary Clinton and Donald Trump found on the website www.whatthefolly.com.

## MATRIX of Term Frequency–Inverse Document Frequency (TF-IDF)

Many applications in Natural Language Processing (NLP) go through the crucial step of transforming words into machine-readable numerical vectors. The TF-IDF fulfills this role with an extra feature: it also gives us a measure of how important a word is to a document in the corpus.

Specifically, we build a matrix where each row indexes a document (or speech) and each column indexes a word (or term). There are as many columns as there are unique words throughout the corpus. In this matrix, entry i, j is a non-negative number that represents word j’s frequency in document i relative to word j’s frequency in all other documents.

I am not going to give you the details of how this relative frequency is computed. For our purposes, we just need to understand that a word's relative frequency in a document is greater as it is more frequent in the document compared to other words in the same document and as it is less frequent in all other documents.

To make this a bit more clear, suppose we have the following 4 speeches:

• Speech 1 : This is a speech on gun violence at night.
• Speech 2 : This is a speech on gun control.
• Speech 3 : This is a speech on Medicare.
• Speech 4 : This is the rhythm of the night.

The TF-iDF matrix is given right below:

Note that the words “this,” “is,” “the,” “of,” “on,” “a,” and “at” are so-called English stop-words. Those stop-words are words that are very frequent in the English language and do not convey much information. In the majority of text mining applications, those words are removed, and we did the same here.

## Non Negative Matrix Factorization (NMF)

Now that we have our (non-negative)TF-IDF matrix, NMF will allow us to approximate this matrix as the product of 2 non-negative matrices of chosen dimension. Specifically, we approximate our TF-IDF matrix as the product a document-topic matrix and a topic-term matrix.

Going back to our working example, suppose we want to group 4 speeches into 2 topic groups (the number of groups is arbitrary: we could have chosen 1, 3 or even 4 topic groups depending on how "broad" we want the topics to be).

The speech-topic matrix (left pane) reveals how important each speech is to each different topic whereas the topic-word matrix shows the importance of each word each to each topic. It's this latter matrix that we use to summarize our speeches.

In our example, speeches 1 through 3 have been grouped into one topic with the words “speech” and “guns” being the most important features. Speech 1 and 4 also seem important to the second topic because they share the word "night." In short, NMF has summarized four speeches into two groups that appear to be talking about “guns” and “nocturnal rhythms,” respectively.

## Summarizing Clinton and Trump Speeches

It’s time to apply our NMF techniques to Trump and Clinton speeches. With about 60 Trump speeches and 90 Clinton speeches, we can look at a larger number of topics (say 10). You can find below the 10 most important words of the topic-word matrix per topic.

### 10 Clinton Topics

Topic/Word Word1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Word 9 Word 10
Topic 1 woman fight President know America parenthood republican just family right
Topic 2 Iran Syria agreement think get Israel Assad Putin sanction Isis
Topic 3 health care insure state afford cost want work act pocket
Topic 4 gun immune vote Sanders consensus people NRA violence gunmaker coalition
Topic 5 email comittee testify Palestinian Savannah talk state people transparent know
Topic 6 Isis terrorists Muslim attack American radical need intelligence Jihadist secure
Topic 7 want go people make college pay debt work family tax
Topic 8 need policy justice criminal man officer overdose epidemic heroin opioid
Topic 9 America thank country right campaign work know job make people
Topic 10 Libya Gaddafi Arab lot offer send try think help moderate

### 10 Trump Topics

f
Topic/Word Word1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Word 9 Word 10
Topic 1 know go say people think like want thing look money
Topic 2 policy America Clinton Hillary country foreign President radical Islam job
Topic 3 go people thank want great love win know just really
Topic 4 great Atlanta city country bankrupt overlap deal make Chris file
Topic 5 Israel deal Iran Palestinian state united missile President thank terror
Topic 6 English Spanish assimilate school speak voucher kid program question country
Topic 7 scholar baby vaccine autism period year legal child congress think
Topic 8 tax reduct like pay thing discuss make foreget tonight major
Topic 9 Lucent Carly company catastrophe tenure Sonnenfeld Jeffrey Compaq Haven CEO
Topic 10 people Assad think response leave pay impact need social policy

I'll let you be the judge of the method I presented today. In my opinion, NMF is a pretty nifty tool to gain preliminary insight into a large number of documents but falls short of nuances.

Without prior knowledge of who Donald Trump is and what he does, it would be quite difficult to know his take on vaccines from looking at Topic 7 alone. Is he himself an anti-vaccine proponent or is he criticizing the anti-vaccine movement?

Feel free to try Dataiku, and send me an email if you have any questions, comments or suggestions!