What are large language models (LLMs)? generative modeling

synthetic data

What is synthetic data?

Synthetic data is information that's artificially manufactured rather than generated by real-world events. It's created algorithmically and is used as a stand-in for test data sets of production or operational data, to validate mathematical models and to train machine learning (ML) models.

While gathering high-quality data from the real world is difficult, expensive and time-consuming, synthetic data technology enables users to quickly, easily and digitally generate the data in whatever amount they desire, customized to their specific needs.

Why is synthetic data important?

The use of synthetic data is gaining wide acceptance because it can provide several benefits over real-world data. Gartner predicted that, by 2024, 60% of the data used for developing AI and analytics will be artificially produced.

The largest application of synthetic data is in the training of neural networks and ML models, as the developers of these models need carefully labeled data sets that could range from a few thousand to tens of millions of items. Synthetic data can be artificially generated to mimic real data sets, enabling companies to create a diverse and large amount of training data without spending a lot of money and time. According to Paul Walborsky, co-founder of AI.Reverie, one of the first dedicated synthetic data services, a single image that would cost $6 from a labeling service can be artificially generated for 6 cents.

Synthetic data can also be used to protect user privacy and comply with privacy laws, particularly when dealing with sensitive health and personal data. Additionally, it can be used to lessen bias in data sets by ensuring that consumers have access to diverse data that accurately depicts the real world.

How is synthetic data generated?

The process of generating synthetic data differs by the tools and algorithms used and the specific use case.

Image showing how a GAN is trained
The generative adversarial network training process is a popular approach used for producing AI-generated content.

The following are three common techniques used for creating synthetic data:

  1. Drawing numbers from a distribution. Randomly selecting numbers from a distribution is a common method for creating synthetic data. Although this method doesn't capture the insights of real-world data, it can produce a data distribution that closely resembles real-world data.
  2. Agent-based modeling. This simulation technique involves creating unique agents that communicate with one another. These methods are especially helpful when examining how different agents -- such as mobile phones, people or even computer programs -- interact with one another in a complex system. Using pre-built core components, Python packages, such as Mesa, make it easier to quickly develop agent-based models and view them via a browser-based interface.
  3. Generative models. These algorithms can generate synthetic data that replicates the statistical properties or features of real-world data. Generative models use a set of training data to learn the statistical patterns and relationships in the data and then use this knowledge to generate new synthetic data that's similar to the original data. Examples of generative models include generative adversarial networks and variational autoencoders.

What are the advantages of synthetic data?

Synthetic data offers the following advantages:

  • Customizable data. An organization can customize synthetic data to its needs, tailoring the data to certain conditions that can't be obtained with authentic data. They can also generate data sets for software testing and quality assurance (QA) purposes for DevOps teams.
  • Cost-effective. Synthetic data is an inexpensive alternative to real-world data. For example, real vehicle crash data can cost an automaker more to collect than simulated data.
  • Data labeling. Even when synthetic data is available, it isn't always labeled. For supervised learning tasks, manually labeling a multitude of instances can be time-consuming and error-prone. Synthetically labeled data can be created to speed up the model development process. Additionally, it guarantees labeling accuracy.
  • Faster production. Because synthetic data isn't gathered from actual events, it's possible to create a data set more quickly with the right software and technology. As a result, a significant amount of artificial data can be created in a shorter amount of time.
  • Complete annotation. Perfect annotation eliminates the need for manual data collection. Each object in a scene can automatically create a variety of annotations. This is also one of the main reasons synthetic data is so inexpensive when compared to real data.
  • Data privacy. While synthetic data can resemble real data, it shouldn't contain any information that could be used to identify the real data. This characteristic makes the synthetic data anonymous and suitable for dissemination and can be a major plus point for the healthcare and pharmaceutical industries.
  • Full user control. A synthetic data simulation enables complete control over every aspect. The person handling the data set can control event frequency, item distribution and many other factors. ML practitioners also have total control over the data set when using synthetic data. Some examples include controlling the degree of class separations, sampling size and level of noise in the data set.

Synthetic data also comes with some drawbacks, including inconsistencies when trying to replicate the complexity found within the original data set and the inability to replace authentic data outright, as accurate, authentic data is still required to produce useful synthetic examples of the information.

What are the use cases for synthetic data?

Synthetic data should appropriately reflect the original data that it strives to improve. Typical use cases for synthetic data include the following:

  • Testing. Compared to rules-based test data, synthetic test data is easier to create and offers flexibility, scalability and realism. For data-driven testing and software development, synthetic data is crucial.
  • AI/ML model training. Synthetic data is increasingly being used to train AI models, as it often outperforms real-world data and is essential for developing superior AI models. Model performance is enhanced by synthetic training data, which also eliminates bias and adds fresh domain knowledge and explainability. Besides being completely privacy-compliant, it also enhances the original data thanks to the nature of the AI-powered synthetization process. For example, in artificial training data, uncommon patterns and occurrences can be upsampled.
  • Privacy regulations. Synthetic data enables data scientists to abide by data privacy laws, such as the Health Insurance Portability and Accountability Act, General Data Protection Regulation and California Consumer Privacy Act. It's also the best option when using sensitive data sets for testing or training. Synthetic data enables organizations to gain insights without jeopardizing privacy compliance.
  • Health and privacy. Health and privacy data are particularly appropriate for a synthetic approach because privacy rules place significant restrictions on these fields. By using synthetic data, researchers can extract the information they require without invading people's privacy. Because synthetic data doesn't represent the data of actual patients, it's extremely unlikely that it results in the reidentification of an actual patient or their personal data record. Synthetic data also has a big advantage over data masking techniques, which pose greater privacy-related risks.

What are examples of synthetic data?

Synthetic data is used across many different industries for various use cases. The following are some examples of synthetic data applications:

  • Media data. In this use case, computer graphics and image processing algorithms are used to generate synthetic images, audio and video. For example, Amazon uses synthetic data to train Amazon Alexa's language system.
  • Text data. This can include chatbots, machine translation algorithms and sentimental analysis based on artificially generated text data. ChatGPT is an example of a tool that uses text data.
  • Tabular data. This consists of synthetically generated data tables used for data analysis, model training and other applications.
  • Unstructured data. Unstructured data can include images, video and audio data that are mostly employed in fields such as computer vision, speech recognition and autonomous vehicle technology. For example, Google's Waymo uses synthetic data to train its self-driving cars.
  • Financial services data. The financial sector relies heavily on synthetic data, especially for fraud detection, risk management and credit risk assessments. For example, JPMorgan and American Express use synthetic financial data to improve fraud detection.
  • Manufacturing data. The manufacturing industry uses synthetic data for quality control testing and predictive maintenance. For instance, German insurance company Provinzial tests synthetic data for predictive analytics.

Synthetic data vs. real data

Financial services and healthcare are two industries that benefit from synthetic data techniques. The techniques can be used to manufacture data with attributes similar to actual sensitive or regulated data. This enables data professionals to use and share data more freely.

For example, synthetic data enables healthcare data professionals to enable public use of record-level data but still maintain patient confidentiality.

In the financial sector, synthetic data sets, such as debit and credit card payments, that look and act as typical transaction data can help expose fraudulent activity. Data scientists can use synthetic data to test or evaluate fraud detection systems, as well as develop new fraud detection methods. Synthetic financial data sets can be found on Kaggle, a crowdsourced platform that hosts predictive modeling and analytics competitions.

DevOps teams use synthetic data for software testing and QA. They can plug artificially generated data into a process without taking authentic data out of production. However, some experts recommend DevOps teams choose data masking techniques over synthetic data techniques because production data sets contain complex relationships that make it hard to manufacture an accurate representation quickly and cheaply.

Synthetic data and machine learning

Synthetic data is gaining traction within the machine learning domain. ML algorithms are trained using an immense amount of data, and collecting the necessary amount of labeled training data can be cost-prohibitive.

Synthetically generated data can help companies and researchers build data repositories needed to train and even pre-train ML models, a technique referred to as transfer learning.

Research efforts to advance synthetic data use in ML are underway. For example, members of the Data to AI Lab at the Massachusetts Institute of Technology Laboratory for Information and Decision Systems documented the recent successes it had with its Synthetic Data Vault, which can construct ML models to automatically generate and extract its own synthetic data.

Companies are also beginning to experiment with synthetic data techniques. For example, a team at Deloitte LLC used synthetic data to build an accurate model by artificially manufacturing 80% of the training data, using real data as seed data. Computer vision, image recognition and robotics are additional applications that are benefiting from the use of synthetic data.

What is the history of synthetic data?

Synthetic data dates back to the advent of computing in the 1970s. Most initial systems and algorithms depended on data to function. However, restricted processing capacity, challenges in collecting vast volumes of data and privacy concerns led to the creation of synthetic data.

In the wake of the ImageNet competition of 2012 -- commonly referred to as the Big Bang of AI -- a group of researchers led by Geoff Hinton succeeded in training an artificial neural network to win an image classification challenge with a startlingly large margin. Researchers began looking for artificial data seriously once it was revealed that neural networks could recognize items more quickly than humans.

Machine learning can use synthetic data to remove bias, democratize data, enhance privacy and reduce costs. Learn how synthetic data may solve problems of bias and privacy in machine learning.

This was last updated in March 2023

Continue Reading About synthetic data

Dig Deeper on IT applications, infrastructure and operations

Cloud Computing
Mobile Computing
Data Center
and ESG