Arguably more famous today than Michael Bay’s Transformers (at least among the deep learning community), NLP transformer architecture and transformer-based models have been breaking all kinds of performance records across an array of NLP tasks.
This novel architecture has set the stage for a number of state-of-the-art models in the field of natural language processing, such as Google AI’s BERT (and its variants), or Open AI’s GPT-2. The latter, which is an advanced conversational AI model, was judged so powerful that up until recently Open AI had only publicly released a “weakened” version of it, out of fear for it being misused to disseminate misinformation.
But in order to understand the hype around Transformer models and their real-world implications, it’s worth taking a step back and looking into the architecture and inner workings behind these models. In this blog post, we’ll walk you through the rise of the transformer architecture, starting by its key component -- the Attention paradigm.
Attention Mechanisms - the Origin Story: Machine Translation
The Attention paradigm made its grand entrance into the NLP landscape back in 2014, before the deep learning hype, and was first applied to the problem of machine translation.
Typically, a machine translation system follows a basic encoder-decoder architecture (as shown in the image below), where both the encoder and decoder are generally variants of recurrent neural networks (RNNs). To understand how a RNN works, it helps to imagine it as a succession of cells. The encoder RNN receives an input sentence and reads it one token at a time: each cell receives an input word and produces a hidden state as output, which is then fed as input to the next RNN cell, until all the words in the sentence are processed.
After this, the last-generated hidden state will hopefully capture the gist of all the information contained in every word of the input sentence. This vector, called the context vector, will then be fed as input to the decoder RNN, which will produce the translated sentence one word at a time.
But is it safe to reasonably assume that the context vector can retain ALL the needed information of the input sentence? What about if the sentence is, say, 50 words long? Because of the inherent sequential structure of RNNs, each input cell only produces one output hidden state vector for each word in the sentence, one by one. Due to the sequential order of word processing, it’s harder for the context vector to capture all the information contained in a sentence for long sentences with complicated dependencies between words -- this is referred to as “the bottleneck problem”.
Solving the Bottleneck Problem With Attention
To address this bottleneck issue, researchers created a technique for paying attention to specific words. When translating a sentence or transcribing an audio recording, a human agent would pay special attention to the word they are presently translating or transcribing.
Neural networks can achieve this same behavior using Attention, focusing on part of a subset of the information they are given. Remember that each input RNN cell produces one hidden state vector for each input word. We can then concatenate these vectors, average them, or (even better!) weight them as to give higher importance to words — from the input sentence — that are most relevant to decode the next word (of the output sentence). This is what the Attention technique is all about.
If you want to get even more into the nitty gritty of the Attention mechanism’s inner workings, we recommend that you read this Data from the Trenches blog post.
As you now understand, Attention was a revolutionary idea in sequence-to-sequence systems such as translation models. NLP Transformer models are based on the Attention mechanism, taking its core idea even further: in addition to using Attention to compute representations (i.e., context vectors) out of the encoder’s hidden state vectors, why not use Attention to compute the encoder’s hidden state vectors themselves? The immediate advantage of this is getting rid of the inherent sequential structure of RNNs, which hinders the parallelization of models.
To solve the problem of parallelization, Attention boosts the speed of how fast the model can translate from one sequence to another. Thus, the main advantage of Transformer models is that they are not sequential, which means that unlike RNNs, they can be more easily parallelized, and that bigger and bigger models can be trained by parallelizing the training.
What’s more, Transformer models have so far displayed better performance and speed than RNN models. Due to all these factors, a lot of the NLP research in the past couple of years has been focused on Transformer models, and we can expect this to translate into exciting new business use cases as well.