AI is the sexiest trend in tech, and NLP (Natural Language Processing) is one of its many branches. NLP is the ability of a machine to understand and “speak” (generate) the human language.
In an age of information overload, the ability of computers to understand the form and meaning of data is especially important.
Types of NLP & Uses of Each
NLP sounds like a very niche thing, but it’s actually incredibly prevalent. You’ve probably encountered a natural language processing system in your day to day life without realizing it.
Some common subfields of NLP are:
- Question answering (search engines)
- Speech recognition (Siri, Alexa)
- Machine translation - translating from one language to another (Google Translate)
- Information extraction - pulling relevant details from unstructured and/or structured data (like important info from health records, relevant news that could impact a trade for a trading algorithm, etc.)
- Sentiment analysis - detecting the attitude (positive, negative, neutral) of a piece of text (used by businesses on their social media comments or for customer service, etc.)
How NLP Works
Because there are many nuances and inconsistencies to human language, NLP is notoriously difficult. Figuring out how to capture these nuances and the context around them in the strict and constrained language of computers is no easy task.
So how do machine whisperers do it?
Pre-processing data: The data must be cleaned and annotated (labeled) so that it can be processed by an algorithm. Cleaning usually involves deconstructing the data into words or chunks of words (tokenization), removing parts of speech without any inherent meaning (like stop words such as a, the, an), making the data more uniform (like changing all words to lowercase), and grouping words into pre-defined categories such as the names of persons (entity extraction). All of this can be done using the spaCy library in Python. Annotation boils down to examining surrounding words and using language rules or statistics to tag parts of speech (similar to how we would use context clues to guess the meaning of a word).
Vectorization (or “embedding”): After preprocessing, the non-numerical data is transformed into numerical data, since it’s much faster for a computer to operate on vectors. Older methods like BOW (Bag of Words), which predict specific words, fail to capture the context of text, so deep learning methods such as word2vec are becoming widespread. Rather than predicting specific words or chunks of words (adding n-gram feature lets you predict n words), word2vec predicts the surrounding words. This means that text vectorizations can be used in various contexts, rather than needing to rebuild a model specific to a set each time.
Testing: Once a baseline has been created (the “rough draft” NLP model), its prediction accuracy is tested using cross validation, a model validation technique that divides data into training and testing subsets. The model is built using the training subset and then tested on the testing subset to see if the model is generalizable-- we don’t want a model that only gives accurate predictions for one specific dataset!
Future of NLP
Natural language processing is a quickly growing field (just look at machines like Alexa and Echo); a Tractica report predicts that NLP software solutions taking advantage of AI will see a market growth from $136 million in 2016 to $5.4 billion by 2025.
With the rise of deep learning solutions such as skip-thought vectors, NLPs current capabilities in healthcare, customer service, finance, and automotive industries will expand, and NLP will become an even more pertinent part of our lives.
To learn more about NLP, including technical details on how to implement it, check out the detailed NLP breakdown on Data From the Trenches, which will run a whole series of in-depth, technical how-tos on the subject over the coming months.