How to prevent deepfakes in the era of generative AI GAN vs. transformer models: Comparing architectures and uses

Generative models: VAEs, GANs, diffusion, transformers, NeRFs

Choosing the right GenAI model for the task requires understanding the techniques each uses and their specific talents. Learn about VAEs, GANs, diffusion, transformers and NerFs.

Until recently, most AI models focused on becoming better at processing, analyzing and interpreting data. Recent breakthroughs in so-called generative neural network models have brought forth a spate of new tools for creating all kinds of content, from photos and paintings to poems, code, movie scripts and movies.

Overview of top AI generative models

Researchers discovered the promise of new generative AI models in the mid-2010s when variational autoencoders (VAEs), generative adversarial networks (GANs) and diffusion models were developed. Transformers, the groundbreaking neural network that can analyze large data sets at scale to automatically create large language models (LLMs), came on the scene in 2017. In 2020, researchers introduced neural radiance fields (NeRFs), a technique for generating 3D content from 2D images.

These rapidly evolving generative models are a work in progress, as researchers make tweaks that often result in big advances. And the remarkable progress has not slowed down, said Matt White, CEO and founder of Berkeley Synthetic.

"Model architectures are constantly changing, and new model architectures will continue to be developed," said White, who also teaches at University of California, Berkeley.

Each model has its special talent. At present, diffusion models perform exceptionally well in the image and video synthesis domain, and transformers perform well in the text domain. GANs are good at augmenting small data sets with plausible synthetic samples. But choosing the best models is always up to the specific use case.

"All of the models are not equal. AI researchers and ML [machine learning] engineers have to select the appropriate one for the appropriate use case and required performance, as well as consider limitations the models may have in compute, memory and capital," White said.

Transformers, in particular, have driven much of the recent progress in and excitement about generative models.

"The most recent breakthroughs in AI models have come from pre-training models on large amounts of data and using self-supervised learning to train models without explicit labels," said Adnan Masood, chief AI architect at UST, a digital transformation consultancy.

For example, OpenAI's Generative Pre-trained Transformer series of models are some of the largest and most powerful in this category, with one of the latest models, GPT-3, containing 175 billion parameters.

graphic of quotes from AI leaders

Key applications of the top generative AI models

The top generative AI models use different techniques and approaches to generate new data, Masood explained. Key features and uses include the following:

  • VAEs use an encoder-decoder architecture to generate new data, typically for image and video generation, such as generating synthetic faces for privacy protection.
  • GANs use a generator and discriminator to generate new data and are often used in video game development for creating realistic game characters.
  • Diffusion models add and then remove noise to generate quality images with high levels of detail, creating near-realistic images of natural scenes.
  • Transformers effectively process sequential data in parallel for machine translation, text summarization and image creation.
  • NeRFs provide a novel approach to 3D scene reconstruction that uses a neural representation.

Let's dive into each approach in more detail.


VAEs were developed in 2014 to encode data more efficiently using a neural network.

Yael Lev, head of AI for Sisense, an AI analytics platform, said VAEs learn to represent information more efficiently. They have two parts: an encoder that makes the data smaller and a decoder that brings it back to its original form. They are ideal at making new examples from the smaller information, fixing noisy images or data, finding unusual things in data and filling in the missing information.

However, VAEs also tend to produce blurry or low-quality images, UST's Masood said. Another issue is that the latent space, a low-dimensional space for capturing the structure of data, is intricate and challenging to work with. These weaknesses can limit the effectiveness of VAEs in applications where high-quality images or a clear understanding of the latent space is critical. The next iteration of VAEs will likely focus on improving the quality of generated data, increasing training speed and exploring their applicability to sequential data.


GANs were developed in 2014 to generate realistic faces and printed numbers. GANs pit a generating neural network that creates realistic content against a discriminating neural network for detecting fake content. "Iteratively, the two networks converge to produce a generated image that is indistinguishable from the original data," said Anand Rao, global AI lead at PwC.

GANs are commonly used for image generation, image editing, super-resolution, data augmentation, style transfer, music generation, and deepfake creation.

One issue with GANs is that they can suffer from mode collapse in which the generator produces limited and repetitive outputs, making them difficult to train. Masood said the next generation of GANs will focus on improving the stability and convergence of the training process, expanding their applicability to other domains, and developing more efficient evaluation metrics.

Lev observed that GANs are also hard to optimize and stabilize, and there is no explicit control over the generated samples.


Diffusion models were developed by a team of Stanford researchers in 2015 to model and reverse entropy and noise. The terms Stable Diffusion and diffusion are sometimes used interchangeably since the Stable Diffusion application -- released in 2022 -- helped bring attention to the older technique of diffusion. The diffusion techniques provide a way to model phenomena, such as how a substance like salt diffuses into a liquid, and then reverse it. This same model is also helpful for generating new content from a blank image.

White said that diffusion models are the current go-to for image generation. They are the base model for popular image generation services, such as Dall-E 2, Stable Diffusion, Midjourney and Imagen. They are also used in pipelines for generating voices, video and 3D content. In addition, the diffusion technique can also be used for data imputation, where missing data is predicted and generated.

Many applications pair diffusion models with an LLM for text-to-image or text-to-video generation. For example, Stable Diffusion 2 uses a Contrastive Language-Image Pre-training model as the text encoder. It also adds models for depth and upscaling.

Masood predicted that further improvements to models like Stable Diffusion may focus on improvement to negative prompting, enhancing the ability to generate images in the style of specific artists and improving celebrity images.


Transformers were developed in 2017 by a team at Google Brain to improve language translation. They are well suited for processing information in a different order than given, processing data in parallel and scaling up to large models using unlabeled data.

White said they can be used for text summarization, chatbots, recommendation engines, language translation, knowledge bases, hyperpersonalization (through preference models), sentiment analysis, and named entity recognition for identifying people, places and things. They can also be used for speech recognition like OpenAI's Whisper, object detection in videos and images, image captioning, text classification activities and dialogue generation.

Their versatility notwithstanding, transformers do have limitations. They can be expensive to train and require large data sets.

The resulting models are also quite large, which makes it challenging to identify the source of bias or inaccurate results. Masood said, "Their complexity can also make it difficult to interpret their inner workings, hindering their explainability and transparency."


NeRFs were developed in 2020 to capture 3D representations of light fields into a neural network. The first implementation was extremely slow and took several days to capture the first 3D imagery.

However, in 2022, Nvidia researchers found a way to generate a new model in about 30 seconds. These models can represent 3D objects -- with comparable quality -- in a few megabytes that can take gigabytes with other techniques. There is hope they could lead to more efficient techniques for capturing and generating 3D objects in the metaverse. Nvidia Director of Research Alexander Keller told Time that NeRFs "could ultimately be as important to 3D graphics as digital cameras have been to modern photography."

Masood said NeRFs have also shown great potential for robotics, urban mapping, autonomous navigation and virtual reality applications.

However, NERFs are still computationally expensive. It's also challenging to compose multiple NERFs into larger scenes. White cautioned that the only viable use case for NeRFs today is to convert images into 3D objects or scenes.

Despite these limitations, Masood predicted NeRFs will find new roles in fundamental image processing tasks, such as denoising, deblurring, upsampling, compression and image editing.

GenAI ecosystem, a work in progress

It's important to note that these models are works in progress. Researchers are looking to improve individual models and ways of combining them with other models and processing techniques.

Lev predicted that generative models will become more versatile, with applications expanding beyond their traditional domains. Users can also guide the AI models more efficiently and understand how they work better.

In addition, each technique will get better at supporting additional types of data, Rao said.

"Currently, many of the techniques are optimized for the specific modality of data, such as text or images," he said. "We will see more multimodal generation techniques that use the same underlying technique for all different modalities of data."

White pointed out there is also work being done on multimodal models that use retrieval methods to call upon a library of models optimized for specific tasks. He also expects generative models to develop other functionality, like making API calls and using external tools. For example, an LLM fine-tuned on a company's call center knowledge will provide answers to questions and perform troubleshooting, like resetting a customer modem or sending an email when the issue is resolved.

Indeed, the popular model architectures of today may eventually be replaced by something more efficient in the future. "Perhaps transformers and diffusion models will outlive their usefulness when new architectures arise," White said. We saw this with transformers when their introduction made long short-term memory algorithms and RNNs [recurrent neural networks] less favorable methods for natural language applications."

Rao also predicted that the generative AI ecosystem will evolve into three layers of models. The base layer is a series of text-, image-, voice- and code-based foundational models. These models ingest large volumes of data, are built on large deep learning models and incorporate human judgment.

Next up, industry- and function-specific domain models will improve the processing of healthcare, legal or other types of data.

At the top level, companies will use proprietary data and their subject matter expertise to build proprietary models. These three layers will disrupt how teams develop models and will usher in a new era of model as a service.

How to pick a generative AI model: Top considerations

Top considerations when choosing among models include the following, according to Sisense's Lev:

  • The problem you want to solve. Choose a model that is known to work well for your specific task. For example, use transformers for language tasks and NeRFs for 3D scenes.
  • The amount and quality of your data. Transformers need lots of good data to work well, while VAEs work better with less data.
  • The quality of the results. GANs are better for clear and detailed images, while VAEs are better for smoother results.
  • How easy it is to train the model. GANs can be difficult to train, while VAEs and transformers are easier.
  • Computational resources. NeRFs and large transformers both need a lot of computer power to work well.
  • Need for control and understanding. VAEs might be better than GANs if you want more control over the results or a better understanding of how the model works.

Next Steps

Artificial Intelligence glossary: 60+ terms to know

Generative AI challenges that businesses should consider

Assessing different types of generative AI applications

Generative AI landscape: Potential future trends

Successful generative AI examples worth noting

Dig Deeper on AI technologies

Business Analytics
Data Management