Large language models are the dynamite behind the generative AI boom of 2023. However, they've been around for a while.
LLMs are black box AI systems that use deep learning on extremely large datasets to understand and generate new text. Modern LLMs began taking shape in 2014 when the attention mechanism -- a machine learning technique designed to mimic human cognitive attention -- was introduced in a research paper titled "Neural Machine Translation by Jointly Learning to Align and Translate." In 2017, that attention mechanism was honed with the introduction of the transformer model in another paper, "Attention Is All You Need."
Some of the most well-known language models today are based on the transformer model, including the generative pre-trained transformer series of LLMs and bidirectional encoder representations from transformers (BERT).
ChatGPT, which runs on a set of language models from OpenAI, attracted more than 100 million users just two months after its release in 2022. Since then, many competing models have been released. Some belong to big companies such as Google and Microsoft; others are open source.
Constant developments in the field can be difficult to keep track of. Here are some of the most influential models, both past and present. Included in it are models that paved the way for today's leaders as well as those that could have a significant effect in the future.
This article is part of
Top current LLMs
Below are some of the most relevant large language models today. They do natural language processing and influence the architecture of future models.
BERT is a family of LLMs that Google introduced in 2018. BERT is a transformer-based model that can convert sequences of data to other sequences of data. BERT's architecture is a stack of transformer encoders and features 342 million parameters. BERT was pre-trained on a large corpus of data then fine-tuned to perform specific tasks along with natural language inference and sentence text similarity. It was used to improve query understanding in the 2019 iteration of Google search.
The Claude LLM focuses on constitutional AI, which shapes AI outputs guided by a set of principles that help the AI assistant it powers helpful, harmless and accurate. Claude was created by the company Anthropic and powers its two main product offerings: Claude Instant and Claude 2. Claude 2 excels at complex reasoning according to Anthropic.
Cohere is an enterprise LLM that can be custom-trained and fine-tuned to a specific company’s use case. The company that created the Cohere LLM was founded by one of the authors of Attention Is All You Need. One of Cohere’s strengths is that it is not tied to one single cloud -- unlike OpenAI, which is bound to Microsoft Azure.
Falcon 40B is a transformer-based, causal decoder-only model developed by the Technology Innovation Institute. It is open source and was trained on English data. The model is available in two smaller variants as well: Falcon 1B and Falcon 7B (1 billion and 7 billion parameters). Amazon has made Falcon 40B available on Amazon SageMaker. It is also available for free on GitHub.
Just three days into its public release in November 2022, Galactica was Meta's LLM designed specifically for scientists. It was trained on a collection of academic material -- 48 million papers, lecture notes, textbooks and websites. As most models do, it produced AI "hallucinations" that members of the scientific community deemed unsafe because they sounded authoritative. This made them hard to detect quickly and were generated in a domain that requires little margin for error.
GPT-3 is OpenAI's large language model with more than 175 billion parameters, released in 2020. GPT-3 uses a decoder-only transformer architecture. In September 2022, Microsoft announced it had exclusive use of GPT-3's underlying model. GPT-3 is 10 times larger than its predecessor. GPT-3's training data includes Common Crawl, WebText2, Books1, Books2 and Wikipedia.
GPT-3 is the last of the GPT series of models in which OpenAI made the parameter counts publicly available. The GPT series was first introduced in 2018 with OpenAI's paper "Improving Language Understanding by Generative Pre-Training."
GPT-3.5 is an upgraded version of GPT-3 with fewer parameters. GPT-3.5 was fine-tuned using reinforcement learning from human feedback. GPT-3.5 is the version of GPT that powers ChatGPT. There are several models, with GPT-3.5 turbo being the most capable, according to OpenAI. GPT-3.5's training data extends to September 2021.
It was also integrated into the Bing search engine but has since been replaced with GPT-4.
GPT-4 is the largest model in OpenAI's GPT series, released in 2023. Like the others, it's a transformer-based model. Unlike the others, its parameter count has not been released to the public, though there are rumors that the model has more than 170 trillion. OpenAI describes GPT-4 as a multimodal model, meaning it can process and generate both language and images as opposed to being limited to only language. GPT-4 also introduced a system message, which lets users specify tone of voice and task.
GPT-4 demonstrated human-level performance in multiple academic exams. At the model's release, some speculated that GPT-4 came close to artificial general intelligence (AGI), which means it is as smart or smarter than a human. GPT-4 powers Microsoft Bing search, is available in ChatGPT Plus and will eventually be integrated into Microsoft Office products.
Lamda (Language Model for Dialogue Applications) is a family of LLMs developed by Google Brain announced in 2021. Lamda used a decoder-only transformer language model and was pre-trained on a large corpus of text. In 2022, LaMDA gained widespread attention when then-Google engineer Blake Lemoine went public with claims that the program was sentient. It was built on the Seq2Seq architecture.
Large Language Model Meta AI (Llama) is Meta's LLM released in 2023. The largest version is 65 billion parameters in size. Llama was originally released to approved researchers and developers but is now open source. Llama comes in smaller sizes that require less computing power to use, test and experiment with.
Llama uses a transformer architecture and was trained on a variety of public data sources, including webpages from CommonCrawl, GitHub, Wikipedia and Project Gutenberg. Llama was effectively leaked and spawned many descendants, including Vicuna and Orca.
Orca was developed by Microsoft and has 13 billion parameters, meaning it's small enough to run on a laptop. It aims to improve on advancements made by other open source models by imitating the reasoning procedures achieved by LLMs. Orca achieves the same performance as GPT-4 with significantly fewer parameters and is on par with GPT-3.5 for many tasks. Orca is built on top of the 13 billion parameter version of LLaMA.
The Pathways Language Model is a 540 billion parameter transformer-based model from Google powering its AI chatbot Bard. It was trained across multiple TPU 4 Pods -- Google's custom hardware for machine learning. Palm specializes in reasoning tasks such as coding, math, classification and question answering. Palm also excels at decomposing complex tasks into simpler subtasks.
PaLM gets its name from a Google research initiative to build Pathways, ultimately creating a single model that serves as a foundation for multiple use cases. There are several fine-tuned versions of Palm, including Med-Palm 2 for life sciences and medical information as well as Sec-Palm for cybersecurity deployments to speed up threat analysis.
Phi-1 is a transformer-based language model from Microsoft. At just 1.3 billion parameters, Phi-1 was trained for four days on a collection of textbook-quality data. Phi-1 is an example of a trend toward smaller models trained on better quality data and synthetic data.
"We'll probably see a lot more creative scaling down work: prioritizing data quality and diversity over quantity, a lot more synthetic data generation, and small but highly capable expert models," wrote Andrej Karpathy, former director of AI at Tesla and OpenAI employee, in a tweet.
Phi-1 specializes in Python coding and has fewer general capabilities because of its smaller size.
StableLM is a series of open source language models developed by Stability AI, the company behind image generator Stable Diffusion. There are 3 billion and 7 billion parameter models available and 15 billion, 30 billion, 65 billion and 175 billion parameter models in progress at time of writing. StableLM aims to be transparent, accessible and supportive.
Vicuna is another influential open source LLM derived from Llama. It was developed by LMSYS and was fine-tuned using data from sharegpt.com. It is smaller and less capable that GPT-4 according to several benchmarks, but does well for a model of its size. Vicuna has only 33 billion parameters, whereas GPT-4 has trillions.
Although LLMs are a recent phenomenon, their precursors go back decades. Learn how recent precursor Seq2Seq and distant precursor ELIZA set the stage for modern LLMs.
Seq2Seq is a deep learning approach used for machine translation, image captioning and natural language processing. It was developed by Google and underlies some of their modern LLMs, including LaMDA. Seq2Seq also underlies AlexaTM 20B, Amazon's large language model. It uses a mix of encoders and decoders.
Eliza was an early natural language processing program created in 1966. It is one of the earliest examples of a language model. Eliza simulated conversation using pattern matching and substitution. Eliza, running a certain script, could parody the interaction between a patient and therapist by applying weights to certain keywords and responding to the user accordingly. The creator of Eliza, Joshua Weizenbaum, wrote a book on the limits of computation and artificial intelligence.