The release of ChatGPT by OpenAI in December 2022 has drawn an incredible amount of attention. This curiosity extends from artificial intelligence in general to the class of technologies that underpins the AI chatbot in particular. These models, called large language models (LLMs), are capable of generating text on a seemingly endless range of topics. Understanding LLMs is key to understanding how ChatGPT works.
What makes LLMs impressive is their ability to generate human-like text in almost any language (including coding languages). These models are a true innovation — nothing like them has existed in the past.
This article will explain what these models are, how they are developed, and how they work. That is, to the extent that we understand how they work. As it turns out, our understanding of why they work is — spookily — only partial.
A Large Language Model Is a Type of Neural Network
A neural network is a type of machine learning model based on a number of small mathematical functions called neurons. Like the neurons in a human brain, they are the lowest level of computation.
Each neuron is a simple mathematical function that calculates an output based on some input. The power of the neural network, however, comes from the connections between the neurons.
Each neuron is connected to some of its peers, and the strength of each connection is quantified through a numerical weight. They determine the degree to which the output of one neuron will be taken into account as an input to a following neuron.
A neural network could be very small. For example, a basic one could have six neurons with a total of eight connections between them. However, a neural network could also be very large, as is the case of the LLMs. These may have millions of neurons with many hundreds of billions of connections between them, with each connection having its own weight.
An LLM Uses a Transformer Architecture
We already know that an LLM is a type of neural network. More specifically, LLMs use a particular neural network architecture called a transformer, which is designed to process and generate data in sequence, like text.
An architecture in this context describes how the neurons are connected to one another. All neural networks group their neurons into a number of different layers. If there are many layers, the network is described as being “deep,” which is where the term “deep learning” comes from.
In a very simple neural network architecture, each neuron may be connected to every neuron in the layer above it. In others, a neuron may only be connected to some other neurons that are near it in a grid.
The latter is the case in what are called Convolutional Neural Networks (CNN). CNNs have formed the foundation of modern image recognition over the past decade. The fact that the CNN is structured in a grid (like the pixels in an image) is no coincidence — in fact, it is an important reason for why that architecture works well for image data.
A transformer, however, is somewhat different. Developed in 2017 by researchers at Google, a transformer is based on the idea of “attention,” whereby certain neurons are more strongly connected (or “pay more attention to”) other neurons in a sequence.
Since text is read in and read out in a sequence, one word after the other, with different parts of a sentence referring to or modifying others (such as an adjective that modifies the noun but not the verb) it is also no coincidence that an architecture that is built to work in sequence, with different strengths of connection between different parts of that sequence, should work well on text-based data.
An LLM Builds Itself
In its simplest terms, a model is a computer program. It is a set of instructions that perform various calculations on its input data and provides an output.
What is particular about a machine learning or AI model, however, is that rather than writing those instructions explicitly, the human programmers instead write a set of instructions (an algorithm) that then reviews large volumes of existing data to define the model itself. As such, the human programmers do not build the model, they build the algorithm that builds the model.
In the case of an LLM, this means that the programmers define the architecture for the model and the rules by which it will be built. But they do not create the neurons or the weights between the neurons. That is done in a process called “training” during which the model, following the instructions of the algorithm, defines those variables itself.
In the case of an LLM, the data that is reviewed is text. In some cases, it may be more specialized or more generic. In the largest models, the objective is to provide the model with as much grammatical text as possible to learn from.
Over the process of training, which may consume many millions of dollars worth of cloud computing resources, the model reviews this text and attempts to produce text of its own. Initially, the output is gibberish, but through a massive process of trial and error — and by continually comparing its output to its input — the quality of the output gradually improves. The text becomes more intelligible.
Given enough time, enough computing resources, and enough training data, the model “learns” to produce text that, to the human reader, is indistinguishable from text written by a human. In some cases, human readers may provide feedback in a sort of reward model, telling it when its text reads well, or when it doesn’t (this is called “reinforcement learning from human feedback,” or RLHF). The model takes this into account and continuously improves itself, based on that feedback.
An LLM Predicts Which Word Should Follow the Previous
A reductive description of LLMs that has emerged is that they “simply predict the next word in a sequence.” This is true, but it ignores the fact that this simple process can mean tools like ChatGPT generate remarkably high-quality text. It is just as easy to say that “the model is simply doing math,” which is also true, but not very useful in helping us to understand how the model works or in appreciating its power.
The result of the training process described above is a neural network with hundreds of billions of connections between the millions of neurons, each defined by the model itself. The largest models represent a large volume of data, perhaps several hundred gigabytes just to store all of the weights.
Each of the weights and each of the neurons is a mathematical formula that must be calculated for each word (or, in some cases, a part of a word) that is provided to the model for its input, and for each word (or part of a word) that it generates as its output.
It’s a technical detail, but these “small words or parts of words” are called “tokens,” which is often how the use of these models is priced when they are provided as a service — more on that later.
The user interacting with one of these models provides an input in the form of text. For example, we can provide the following prompt to ChatGPT:
Hello ChatGPT, please provide me with a 100-word description of Dataiku.
Include a description of its software and its core value proposition.
The models behind ChatGPT would then break that prompt into tokens. On average, a token is ⅘ of a word, so the above prompt and its 23 words might result in about 30 tokens. The GPT-3 model that gpt-3.5-turbo model is based on has 175 billion weights. The GPT-4 model, which is also available in ChatGPT, has an unknown number of weights.
Then, the model would set about generating a response that sounds right based on the immense volume of text that it consumed during its training. Importantly, it is not looking up anything about the query. It does not have any memory wherein it can search for “dataiku,” “value proposition,” “software,” or any other relevant terms. Instead, it sets about generating each token of output text, it performs the computation again, generating a token that has the highest probability of sounding right.
LLMs Produce Text That Sounds Right but Cannot Guarantee That It Is Right
ChatGPT can provide no guarantee that its output is right, only that it sounds right. Its responses are not looked up in its memory — they are generated on the fly based on those 175 billion weights described earlier.
This is not a shortcoming specific to ChatGPT but of the current state of all LLMs. Their skill is not in recalling facts — the simplest databases do that perfectly well. Their strength is, instead, in generating text that reads like human-written text and that, well, sounds right. In many cases, the text that sounds right will also actually be right, but not always.
Given that LLMs' knowledge is limited to the information present in their training dataset, it may be incomplete, erroneous, or simply outdated. For example, the training dataset of ChatGPT ends in September 2021. ChatGPT can then only be aware of facts known before this cutoff date. Moreover, most if not all the facts in this training dataset are publicly available on the internet. This means that ChatGPT cannot answer questions pertaining to private information. This is a concern because the most interesting business use cases require taking into account non-public documents such as meeting minutes, legal documents, product specifications, technical documentation, R&D reports, business procedures, and so on.
A way to mitigate this drawback is, whenever a user asks a question, to retrieve relevant facts from an up-to-date knowledge base and send both the question and the facts to the LLM. Dataiku has developed a demo using GPT-3 to provide answers from Dataiku's documentation that does just that. Language models can be augmented with tools that give them access to external knowledge sources, extending their reasoning capabilities and enabling them to act in the real world. Such augmented LLMs are usually referred to as “agents.” In this context, tools are simply programs that take some text as input and provide or summarize their results as text. Examples of such tools include:
Summary of the corresponding Wikipedia page
Temperature, precipitation, and wind forecasts
Email address, content
Confirmation that the email was sent
Example of tools that LLMs can leverage
Under the hood, tools can be very simple programs, with just a few lines of Python code, but they can also be quite involved and rely on external APIs or machine learning (ML) models (or even an LLM augmented with other tools!). Be sure to check out our guidebook “Introduction to Large Language Models With Dataiku” for more details on such tools.
Is GPT-4 an LLM?
On March 14, 2023, OpenAI released GPT-4, the latest version of its models in the GPT family. In addition to generating higher-quality text compared to GPT-3.5, GPT-4 introduces the ability to recognize images. However that functionality, if it exists, is not yet available. The ability to handle input and output data of different types (text and images) means that GPT-4 is multimodal.