putilov_denis - stock.adobe.com

Exploring GPT-3 architecture

GPT-3 is one of the largest and most well-known neural networks for natural language applications available. With 175 billion parameters, the model easily outpaces similar models.

OpenAI's GPT-3 architecture represents a seminal shift in AI research and use. The largest neural network ever developed promises significant improvements in natural language tools and applications.

Developers can use the deep learning-powered language model to develop just about anything related to language. The approach holds promise for startups developing advanced natural language processing (NLP) tools not only for B2C applications but to integrate into enterprise B2B use cases.

Generative Pre-trained Transformer 3 (GPT-3) is "arguably the biggest and best general-purpose NLP AI model out there," said Vishwastam Shukla, CTO at HackerEarth.

Because the model is so generic, users can go wild with how they want to use GPT-3, including creating mobile apps, building search engines, translating languages and writing poetry. The model includes over 10 times as many parameters as the next largest NLP model, Microsoft's Turing-NLG, so GPT-3's accuracy and the value it can deliver are significantly higher.

GPT-3 parameters

"The AI industry is excited about GPT-3 because of the sheer flexibility that 175 billion weighted connections between parameters bring to NLP application development," said Dattaraj Rao, chief data scientist at Persistent Systems.

OpenAI, the artificial intelligence research lab that created GPT-3, trained the model on over 45 terabytes of data from the internet and books to support its 175 billion parameters.

"Parameters in machine language parlance depict skills or knowledge of the model, so the higher the number of parameters, the more skillful the model is," Shukla said.

GPT-3, BERT, parameters, transformer models

Parameters are like variables in an equation, explained Sri Megha Vujjini, a data scientist at Saggezza, a global IT consultancy.

In a basic mathematical equation such as "a+5b=y", "a" and "b" are parameters, and "y" is the result. In a machine learning algorithm, these parameters correspond to the weighting between words, such as the correlation between their meaning or use together.

The next closest model is Microsoft's Turing-NLG, with about 17 billion parameters, and GPT-2, OpenAI's predecessor to GPT-3, only had about 2 billion.

Earlier this year, EleutherAI, a collective of volunteer AI researchers, engineers and developers, released GPT-Neo 1.3B and GPT-Neo 2.7B.

Named for the number of parameters they have, the GPT-Neo models feature architecture very similar to OpenAI's GPT-2.

Rao said it gives comparable performance to GPT-2 and smaller GPT-3 models. Most importantly, developers can download it and fine-tune it with domain-specific text to generate new outcomes. As a result, Rao expects lots of new applications to come out of GPT-Neo.

Meanwhile, researchers are planning even bigger models down the road. Google's Switch Transformer model has 1.6 trillion parameters.

Encoding language skills

Sreekar Krishna, national leader AI and head of data engineering at KPMG US, said, "GPT-3 fundamentally represents the next step in the evolution of a natural learning system."

It demonstrates that a system can learn aspects of domain knowledge and language constructs using millions of examples.

Traditional algorithmic development broke problems into fundamental core micro-problems, which could be individually addressed toward the final solution. Humans solve problems in the same way, but we are aided by decades of training in common sense, general knowledge and business experience.

In the traditional machine learning training process, algorithms are shown a sample of training data and are expected to learn various capabilities to match human decision-making.

Over decades, scientists have tested the idea that if we started feeding algorithms tremendous volumes of data, the algorithms would assimilate the domain-specific data and general knowledge, language grammar constructs and human social norms. However, it was hard to test this theory owing to limited computing power and the challenges of systematically testing highly complex systems.

Yet, the success of the GPT-3 architecture has demonstrated that researchers are on the right track, Krishna said. With enough data and the right architecture, it is possible to encode general knowledge, grammar and even humor into the network.

GPT-3 language models

Ingesting such huge amounts of data from diverse sources created a sort of general-purpose tool in GPT-3.

[GPT-3 is] arguably the biggest and best general-purpose NLP AI model out there.
Vishwastam ShuklaCTO, HackerEarth

"We don't need to tune it for different use cases," Vujjini said.

For example, the accuracy of a traditional model for translating English to German will vary based on how well it was trained and how the data is ingested. But with the GPT-3 architecture, the output seems accurate regardless of how the data is ingested. More significantly, a developer doesn't have to train it with translation examples specifically.

This makes it easy to extend GPT-3 for a broad range of use cases and language models.

"Developers can be more productive by training the GPT-3 model with a few examples, and it will develop an application in any language such as Python, JavaScript or Rust," said Terri Sage, CTO of 1010data.

Sage has also experimented with using it to help companies analyze customer feedback by interpreting language patterns to develop insight.

However, Rao argues that some domain-specific training is required to tune GPT-3 language models to get the most value in real-world applications such as healthcare, banking and coding.

For example, training a GPT-type model on a data set of patient diagnoses by doctors based on symptoms could make it easier to recommend diagnoses. Microsoft, meanwhile, fine-tuned GPT-3 on large volumes of source code for a code auto-completer called Copilot that can automatically generate lines of source code.

GPT-3 vs. BERT

GPT-3 is often compared with Google's BERT language model since both are large neural networks for NLP built on transformer architectures. But there are substantial differences in terms of size, development methods, and delivery models.

Also, due to a strategic partnership between Microsoft and OpenAI, GPT-3 is only offered as a private service, while BERT is available as open source software.

GPT-3 performs better out of the box in new application domains than BERT, said Krishna. This means enterprises can tackle simple business problems more quickly than with BERT.

But, GPT-3 can become unwieldy due to the sheer infrastructure businesses need to deploy and use it, said Shukla. Enterprises can comfortably load the largest BERT model, at 345 million parameters, on a single GPU workstation.

At 175 billion parameters in size, the largest GPT-3 models are almost 470 times the size of the largest BERT model. But GPT-3's large size comes at a much higher computational cost, which is why GPT-3 is only offered as a service, while BERT can be embedded into new applications.

Both BERT and GPT-3 use a transformer architecture to encode and decode a sequence of data. The encoder part creates a contextual embedding for a series of data, while the decoder uses this embedding to create a new series.

BERT has a more substantial encoder capability for generating contextual embedding from a sequence. This is useful for sentiment analysis or question answering. GPT-3, meanwhile, is stronger on the decoder part for taking in context and generating new text. This is useful for writing content, creating summaries or generating code.

Sage said GPT-3 supports significantly more use cases than BERT. GPT-3 is suitable for writing articles, reviewing legal documents, generating resumes, gaining business insights from consumer feedback and building applications. BERT is used more for voice assistance, analysis of customer reviews and some enhanced searches.

Dig Deeper on AI technologies