This tutorial will show how (at a very high level) Non-negative Matrix Factorization(NMF) applied to a matrix of Term Frequency-Inverse Document Frequency (TF-IDF) can be used to extract topics from large collections of text.
In a previous post, I introduced how we can use Markov Chains to generate novelty Clinton and Trump speeches. In this post, however, I want to show you that we can do more serious and useful stuff with text data.
This technique amounts to shrinking a bunch of text documents into a lower number of archetype documents (or topics) that are described by relevant key words. In other words, NMF/TF-IDF is pretty much like your SparkNotes friend who can answer the following question: "Hey dude, can you group Shakespeare's work into five main topics and describe each topic using their 10 most relevant words?"
It's not the first time we're posting about NMF/TF-IDF. In fact, in a previous blog post, Leo already used this technique to discover topics in Queen Elizabeth's Christmas Eve speeches.
Yet, because NMF/TF-IDF is such a nice little tool, I thought I would use it once more to analyze something that couldn't be more relevant: speeches from U.S. presidential candidates Hillary Clinton and Donald Trump found on the website www.whatthefolly.com.
MATRIX of Term Frequency–Inverse Document Frequency (TF-IDF)
Many applications in Natural Language Processing (NLP) go through the crucial step of transforming words into machine-readable numerical vectors. The TF-IDF fulfills this role with an extra feature: it also gives us a measure of how important a word is to a document in the corpus.
Specifically, we build a matrix where each row indexes a document (or speech) and each column indexes a word (or term). There are as many columns as there are unique words throughout the corpus. In this matrix, entry i, j is a non-negative number that represents word j’s frequency in document i relative to word j’s frequency in all other documents.
I am not going to give you the details of how this relative frequency is computed. For our purposes, we just need to understand that a word's relative frequency in a document is greater as it is more frequent in the document compared to other words in the same document and as it is less frequent in all other documents.
To make this a bit more clear, suppose we have the following four speeches:
- Speech 1 : This is a speech on gun violence at night.
- Speech 2 : This is a speech on gun control.
- Speech 3 : This is a speech on Medicare.
- Speech 4 : This is the rhythm of the night.
The TF-iDF matrix is given right below:
Note that the words “this,” “is,” “the,” “of,” “on,” “a,” and “at” are so-called English stop-words. Those stop-words are words that are very frequent in the English language and do not convey much information. In the majority of text mining applications, those words are removed, and we did the same here.
Non Negative Matrix Factorization (NMF)
Now that we have our (non-negative)TF-IDF matrix, NMF will allow us to approximate this matrix as the product of two non-negative matrices of chosen dimension. Specifically, we approximate our TF-IDF matrix as the product a document-topic matrix and a topic-term matrix.
Going back to our working example, suppose we want to group four speeches into two topic groups (the number of groups is arbitrary: we could have chosen one, three, or even four topic groups depending on how "broad" we want the topics to be).
The speech-topic matrix (left pane) reveals how important each speech is to each different topic whereas the topic-word matrix shows the importance of each word each to each topic. It's this latter matrix that we use to summarize our speeches.
In our example, speeches one through three have been grouped into one topic with the words “speech” and “guns” being the most important features. Speech one and four also seem important to the second topic because they share the word "night." In short, NMF has summarized four speeches into two groups that appear to be talking about “guns” and “nocturnal rhythms,” respectively.
Summarizing Clinton and Trump Speeches
It’s time to apply our NMF techniques to Trump and Clinton speeches. With about 60 Trump speeches and 90 Clinton speeches, we can look at a larger number of topics (say 10). You can find below the 10 most important words of the topic-word matrix per topic.
10 Clinton Topics
Topic/Word | Word1 | Word 2 | Word 3 | Word 4 | Word 5 | Word 6 | Word 7 | Word 8 | Word 9 | Word 10 |
---|---|---|---|---|---|---|---|---|---|---|
Topic 1 | woman | fight | President | know | America | parenthood | republican | just | family | right |
Topic 2 | Iran | Syria | agreement | think | get | Israel | Assad | Putin | sanction | Isis |
Topic 3 | health | care | insure | state | afford | cost | want | work | act | |
Topic 4 | gun | immune | vote | Sanders | consensus | people | NRA | violence | gunmaker | coalition |
Topic 5 | committee | testify | Palestinian | Savannah | talk | state | people | transparent | know | |
Topic 6 | Isis | terrorists | Muslim | attack | American | radical | need | intelligence | Jihadist | secure |
Topic 7 | want | go | people | make | college | pay | debt | work | family | tax |
Topic 8 | need | policy | justice | criminal | man | officer | overdose | epidemic | heroin | opioid |
Topic 9 | America | thank | country | right | campaign | work | know | job | make | people |
Topic 10 | Libya | Gaddafi | Arab | lot | offer | send | try | think | help | moderate |
10 Trump Topics
fTopic/Word | Word1 | Word 2 | Word 3 | Word 4 | Word 5 | Word 6 | Word 7 | Word 8 | Word 9 | Word 10 |
---|---|---|---|---|---|---|---|---|---|---|
Topic 1 | know | go | say | people | think | like | want | thing | look | money |
Topic 2 | policy | America | Clinton | Hillary | country | foreign | President | radical | Islam | job |
Topic 3 | go | people | thank | want | great | love | win | know | just | really |
Topic 4 | great | Atlanta | city | country | bankrupt | overlap | deal | make | Chris | file |
Topic 5 | Israel | deal | Iran | Palestinian | state | united | missile | President | thank | terror |
Topic 6 | English | Spanish | assimilate | school | speak | voucher | kid | program | question | country |
Topic 7 | scholar | baby | vaccine | autism | period | year | legal | child | congress | think |
Topic 8 | tax | reduct | like | pay | thing | discuss | make | forget | tonight | major |
Topic 9 | Lucent | Carly | company | catastrophe | tenure | Sonnenfeld | Jeffrey | Compaq | Haven | CEO |
Topic 10 | people | Assad | think | response | leave | pay | impact | need | social | policy |
I'll let you be the judge of the method I presented today. In my opinion, NMF is a pretty nifty tool to gain preliminary insight into a large number of documents but falls short of nuances. Without prior knowledge of who Donald Trump is and what he does, it would be quite difficult to know his take on vaccines from looking at Topic 7 alone. Is he himself an anti-vaccine proponent or is he criticizing the anti-vaccine movement?