What is a transformer model?
A transformer model is a neural network architecture that can automatically transform one type of input into another type of output. The term was coined in a 2017 Google paper that found a way to train a neural network for translating English to French with more accuracy and a quarter of the training time of other neural networks.
The technique proved more generalizable than the authors realized, and transformers have found use in generating text, images and robot instructions. It can also model relationships between different modes of data, called multimodal AI, for transforming natural language instructions into images or robot instructions.
Virtually all applications that use natural language processing now use transformers under the hood because they perform better than prior approaches. Researchers have also discovered that transformer models can learn to work with chemical structures, predict protein folding and analyze medical data at scale.
One essential aspect of transformers is how they take advantage of an AI concept called attention for emphasizing the weight of related words that can help paint the context for a given word or token describing some other type of data -- such as a section of an image or protein structure -- or speech phoneme.
This article is part of
The notion of attention has been around since the 1990s as a processing technique. However, in 2017 a team of Google workers suggested they could use attention to encode the meaning of words and the structure of a given language directly. This was revolutionary because it replaced what previously required an additional encoding step using a dedicated neural network. It also unlocked a way to virtually model any type of information, paving the way for the extraordinary breakthroughs that have emerged over the last several years.
What can a transformer model do?
Transformers are gradually usurping the previously most popular types of deep learning neural network architectures in many applications, including recurrent neural networks (RNNs) and convolutional neural networks (CNNs). RNNs were ideal for processing streams of data such as speech, sentences and code. But they could only process shorter strings at a time. Newer techniques, such as long short-term memory, were RNN approaches that could support longer strings but were still limited and slow. In contrast, transformers can process longer series, and they can process each word or token in parallel, which enables them to scale more efficiently.
CNNs are ideal for processing data, such as analyzing multiple regions of a photo in parallel for similarities in features such as lines, shapes and textures. These networks are optimized for comparing nearby areas. Transformer models, such as the Vision Transformer introduced in 2021, in contrast seem to do a better job comparing regions that might be far away from each other. Transformers also do a better job working with unlabeled data.
Transformers can learn to efficiently represent the meaning of a text by analyzing larger bodies of unlabeled data. This lets researchers scale transformers to support hundreds of billions and even trillions of features. In practice, the pre-trained models created with unlabeled data only serve as a starting point for further refinement for a specific task with labeled data. However, this is acceptable because the secondary step requires less expertise and processing power.
Transformer model architecture
A transformer architecture consists of an encoder and decoder that work together. The attention mechanism lets transformers encode the meaning of words based on the estimated importance of other words or tokens. This enables transformers to process all words or tokens in parallel for faster performance, helping drive the growth of increasingly bigger LLMs.
Thanks to the attention mechanism, the encoder block transforms each word or token into vectors further weighted by other words. For example, in the following two sentences, the meaning of it would be weighted differently owing to the change of the word filled to emptied:
- He poured the pitcher into the cup and filled it.
- He poured the pitcher into the cup and emptied it.
The attention mechanism would connect it to the cup being filled in the first sentence and to the pitcher being emptied in the second sentence.
The decoder essentially reverses the process in the target domain. The original use case was translating English to French, but the same mechanism could translate short English questions and instructions into longer answers. Conversely, it could translate a longer article into a more concise summary.
Transformer model training
There are two key phases involved in training a transformer. In the first phase, a transformer processes a large body of unlabeled data to learn the structure of the language or a phenomenon, such as protein folding, and how nearby elements seem to affect each other. This is a costly and energy-intensive aspect of the process. It can take millions of dollars to train some of the largest models.
Once the model is trained, it's helpful to fine-tune it for a particular task. A technology company might want to tune a chatbot to respond to different customer service and technical support queries with varying levels of detail depending on the user's knowledge. A law firm might adjust a model for analyzing contracts. A development team might tune the model to its own extensive library of code and unique coding conventions.
The fine-tuning process requires significantly less expertise and processing power. Proponents of transformers argue that the large expense that goes into training larger general-purpose models can pay off because it saves time and money in customizing the model for so many different use cases.
The number of features in a model is sometimes used as a proxy for its performance instead of more salient metrics. However, the number of features -- or size of the model -- doesn't directly calibrate with performance or utility. For example, Google recently experimented with training LLMs more efficiently using a mixture-of-experts technique that proved to be about seven times more efficient than other models. Even though some of these resulting models were over a trillion parameters, they were less precise than models with hundreds of times fewer parameters.
However, Meta recently reported that its Large Language Model Meta AI (Llama) with 13 billion parameters outperformed a 175-billion-paramter generative pre-trained transformer (GPT) model on major benchmarks. A 65-billion-parameter variant of Llama matched the performance of models with over 500 billion parameters.
Transformer model applications
Transformers can be applied to virtually any task that processes a given input type to generate an output. Examples include the following use cases:
- Translating from one language to another.
- Programming more engaging and useful chatbots.
- Summarizing long documents.
- Generating a long document from a brief prompt.
- Generating drug chemical structures based on a particular prompt.
- Generating images from a text prompt.
- Creating captions for an image.
- Creating a robotic process automation (RPA) script from a brief description.
- Providing code completion suggestions based on existing code.
Transformer model implementations
Transformer implementations are improving in terms of size, support for new use cases or different domains, such as medicine, science or business apps. The following are some of the most promising transformer implementations:
- Google's Bidirectional Encoder Representations from Transformers was one of the first LLMs based on transformers.
- OpenAI's GPT followed suit and underwent several iterations, including GPT-2, GPT-3, GPT-3.5, GPT-4 and ChatGPT.
- Meta's Llama achieves comparable performance with models 10 times its size.
- Google's Pathways Language Model generalizes and performs tasks across multiple domains, including text, images and robotic controls.
- Open AI's Dall-E creates images from a short text description.
- The University of Florida and Nvidia's GatorTron analyzes unstructured data from medical records.
- DeepMind's Alphafold 2 describes how proteins fold.
- AstraZeneca and Nvidia's MegaMolBART generates new drug candidates based on chemical structure data.