AI chatbots show an impressive ability to generate clear and coherent text from simple natural-language prompts. But what's going on behind the scenes?
In the following excerpt from How AI Works: From Sorcery to Science, a recent release from No Starch Press, author and programmer Ronald Kneusel breaks down the components of large language models (LLMs), which power popular AI chatbots such as OpenAI's ChatGPT and Google Bard. Kneusel explains how LLMs use transformer neural networks -- a type of AI architecture introduced in 2017 -- to process input text, enabling them to identify complex relationships and patterns in massive data sets.
Check out the rest of How AI Works for a deep dive into the history and inner workings of AI that doesn't involve extensive math or programming. For more from Kneusel, read his interview with TechTarget Editorial, where he discusses the generative AI boom, including LLMs' benefits and limitations and the importance of alignment.
Large language models are impressive and powerful. So how do they work? Let's take a shot at an answer.
I'll begin at the end, with a few comments from the conclusion of the "Sparks of Artificial General Intelligence" paper mentioned earlier:
How does [GPT-4] reason, plan, and create? Why does it exhibit such general and flexible intelligence when it is at its core merely the combination of simple algorithmic components -- gradient descent and large-scale transformers with extremely large amounts of data? These questions are part of the mystery and fascination of LLMs, which challenge our understanding of learning and cognition, fuel our curiosity, and motivate deeper research.
That quote contains questions that currently lack convincing answers. Simply put, researchers don't know why large language models like GPT-4 do what they do. There are certainly hypotheses in search of evidence and proof, but as I write this, no proven theories are available. Therefore, we can discuss only the what, as in what a large language model entails, and not the how of its behavior.
Large language models use a new class of neural network, the transformer, so we'll begin there. (GPT stands for generative pretrained transformer.) The transformer architecture appeared in the literature in 2017, with the influential paper "Attention Is All You Need" by Google researchers Ashish Vaswani et al. The paper had been cited over 70,000 times as of March 2023.
Traditionally, models that process sequences (such as sentences) used recurrent neural networks, which pass their output back in as input along with the next input of the sequence. This is the logical model for processing text because the network can incorporate the notion of memory via the output fed back in with the next token. Indeed, early deep learning translation systems used recurrent networks. However, recurrent networks have small memories and are challenging to train, which limits their applicability.
Transformer networks utilize a different approach: they accept the entire input at once and process it in parallel. Transformer networks typically include an encoder and a decoder. The encoder learns representations and associations between the parts of the input (think sentences), while the decoder uses the learned associations to produce output (think more sentences).
Large language models like GPT dispense with the encoder and instead learn the necessary representation in an unsupervised way using an enormous text dataset. After pretraining, the decoder part of the transformer model generates text in response to the input prompt.
The input to a model like GPT-4 is a sequence of text made up of words. The model splits this into units called tokens. A token might be a word, a part of a word, or even an individual character. Pretraining aims to map tokens to a multidimensional embedding space, which it does by associating each token with a vector that can be thought of as a point in that space.
The learned mapping from tokens to vectors captures complex relationships between the tokens so that tokens with similar meanings are nearer to each other than tokens with dissimilar meanings. For example, as shown in Figure 7-3, after pretraining, the mapping (context encoding) will place "dog" closer to "fox" than to "can opener." The embedding space has many dimensions, not the mere two of Figure 7-3, but the effect is the same.
The context encoding is learned during pretraining by forcing the model to predict the next token given all previous tokens in an input. In effect, if the input is "roses are red," then during the pretraining process the model will be asked to predict the next token after "roses are." If the predicted token isn't "red," the model will use the loss function and backpropagation to update its weights, thereby taking a gradient descent step after suitable averaging of the error over a minibatch. For all their abilities, large language models are trained the same way as other neural networks.
Pretraining enables the model to learn language, including grammar and syntax, and seemingly to acquire enough knowledge about the world to allow the emergent abilities that have turned the world of AI on its head.
The decoder step takes the input prompt and produces output token after output token until a unique stop token is generated. Because so much of language and the way the world works was learned during pretraining, the decoder step has the side effect of producing extraordinary output even though the decoder is, in the end, just predicting most likely token after most likely token.
More specifically, during the prediction process, GPT-style models use attention to assign importance to the different tokens in the input sequence, thereby capturing relationships between them. This is the primary difference between a transformer model and older recurrent neural networks. The transformer can pay attention to different parts of the input sequence, enabling it to identify and use the relationships between tokens even if they are far apart within the input.
When used in chat mode, LLMs give the illusion of a back-and-forth discussion when, in reality, each new prompt from the user is passed to the model along with all the previous text (the user's prompts and the model's replies). Transformer models have a fixed input width (context window), which is currently around 4,000 tokens for GPT-3.5 and some 32,000 for GPT-4. The large input window makes it possible for the attention portion of the model to go back to things that appeared far back in the input, which is something recurrent models cannot do.
Large language models are ready for use after pretraining if desired, but many applications fine-tune them first using domain-specific data. For generic models like GPT-4, fine-tuning likely consisted of a step known as reinforcement learning from human feedback (RLHF). In RLHF, the model is trained further using feedback from real human beings to align its responses to human values and societal expectations.
This is necessary because LLMs are not conscious entities, and thus they cannot understand human society and its many rules. For example, unaligned LLMs will respond with step-by-step instructions for many activities that human society restricts, like how to make drugs or bombs. The "Sparks" paper contains several such examples of GPT-4 output before the RLHF step that aligned the model with societal expectations.
Stanford University's open source Alpaca model is based on LLaMa, a large language model from Meta. As of this writing, Alpaca has not undergone an alignment process and will answer questions that GPT and other commercial LLMs correctly refuse to answer.
Conclusion: Alignment is absolutely critical to ensure that powerful language models conform to human values and societal norms.