Generative adversarial networks hold considerable promise for generating media, such as images and voices, as well as drug molecules. They were also one of the most popular generative AI techniques until transformers were introduced a few years ago.
Transformers are a foundational technology underpinning many advances in large language models, such as generative pre-trained transformers (GPTs). They're now expanding into multimodal AI applications capable of correlating content as diverse as text, images, audio and robot instructions across numerous media types more efficiently than techniques like GANs.
Let's explore the beginnings of each technique, their use cases and how researchers are now combining the two techniques into various transformer-GAN combinations.
GAN architecture explained
GANs were introduced in 2014 by Ian Goodfellow and associates to generate realistic-looking numbers and faces. They combine the following two neural networks:
This article is part of
- A generator, which is typically a convolutional neural network (CNN) that creates content based on a text or image prompt.
- A discriminator, typically a deconvolutional neural network that identifies authentic versus counterfeit images.
Before GANs, computer vision was mainly done with CNNs that captured lower-level features of an image, like edges and color, and higher-level features representing entire objects, said Adrian Zidaritz, founder of the Institute for a Stronger Democracy through Artificial Intelligence. The novelty of the GAN architecture resulted from its adversarial approach in which one neural network proposes generated images, while the other one vetoes them if they don't come close to the authentic images from a given data set.
Today, researchers are exploring ways to use other neural network models, including transformers.
Transformer architecture explained
Transformers were introduced by a team of Google researchers in 2017 who were looking to build a more efficient translator. In a paper entitled "Attention Is All You Need," the researchers laid out a new technique to discern the meaning of words based on how they characterized other words in phrases, sentences and essays.
Previous tools to interpret text frequently used one neural network to translate words into vectors using a previously constructed dictionary and another neural network to process a sequence of text, such as a recurrent neural network (RNN). In contrast, transformers essentially learn to interpret the meaning of words directly from processing large bodies of unlabeled text. The same approach can also be used to identify patterns in other kinds of data, such as protein sequences, chemical structures, computer code and IoT data streams. This lets researchers scale the large language models driving recent advances -- and publicity -- in the field. Transformers can also find relationships between words that are far apart, which was impractical with RNNs.
Small snippets of an image can also be defined by the contexts of the entire images in which they appear, Zidaritz said. The idea of self-attention in natural language processing (NLP) becomes self-similarity in computer vision.
GAN vs. transformer: Best use cases for each model
GANs are more flexible in their potential range of applications, according to Richard Searle, vice president of confidential computing at Fortanix, a data security platform. They're also useful where imbalanced data, such as a small number of positive cases compared to the volume of negative instances, can lead to numerous false-positive classifications. As a result, adversarial learning has shown promise in use cases where there's limited training data for discrimination tasks or in fraud detection where only a small number of transactions might represent fraud compared to more common ones. In a fraud scenario, for example, hackers constantly introduce new inputs to fool fraud detection algorithms. GANs tend to be better at adapting to and protecting against these kinds of techniques.
Transformers are typically used where sequential input-output relationships must be derived, Searle said, and the number of possible combinations of features requires focused attention to provide local context. For this reason, transformers have established preeminence in NLP applications, as they can process content of any length, such as phrases or whole documents. Transformers are also good at suggesting the next move in applications like gaming, where a set of potential responses must be evaluated with respect to the conditional sequence of inputs.
There's also active research into combining GANs and transformers into so-called GANsformers. The idea is to use a transformer to provide an attentional reference so the generator can increase the use of context to enhance content.
"The intuition behind GANsformers is that human attention is based on the specific local features of an object of interest, in addition to the latent global characteristics," Searle explained. The resulting improved representations are more likely to simulate both the global and local features a human may perceive in an authentic sample, such as a realistic face or computer-generated audio consistent with a human voice's tone and rhythm.
Are transformer-based networks stronger than GANs?
Searle expects to see more integration to create text, voice and image data with enhanced realism. "This may be desirable where improved contextual realism or fluency in human-machine interaction or digital content would enhance the user experience," he said. For example, GANsformers might be able to generate synthetic data to pass the Turing test when confronted by both a human user and a trained machine evaluator. In the case of text responses, such as those furnished by a GPT system, the inclusion of idiosyncratic errors or stylistic traits could mask the true origin of an AI-derived output.
Conversely, improved realism may be problematic with deepfakes used to launch cyber attacks, damage brands or spread fake news. In these cases, GANsformers could provide better filters to detect deepfakes.
"The use of adversarial training and contextual evaluation could produce AI systems able to provide enhanced security, improved content filtering and defense against misinformation attacks using generative botnets," Searle said.
But Zidaritz believes transformers can potentially edge out GANs in many use cases since they can be applied to text and images more easily. "New GANs will continue to be developed, but their applications will be more limited than that of GPTs," he said. "It is also likely that we will see more GAN-like transformers and transformer-like GANs, in both of which the transformer with its self-attention or its self-similarity mechanism will be central."