synthetic data

Synthetic data is information that's artificially manufactured rather than generated by real-world events. Synthetic data is created algorithmically, and it is used as a stand-in for test datasets of production or operational data, to validate mathematical models and, increasingly, to train machine learning models.

The benefits of using synthetic data include reducing constraints when using sensitive or regulated data, tailoring the data needs to certain conditions that cannot be obtained with authentic data and generating datasets for software testing and quality assurance purposes for DevOps teams.

Drawbacks include inconsistencies when trying to replicate the complexity found within the original dataset and the inability to replace authentic data outright, as accurate authentic data is still required to produce useful synthetic examples of the information.

Synthetic data vs. real data

Financial services and healthcare are two industries that benefit from synthetic data techniques. The techniques can be used to manufacture data with similar attributes to actual sensitive or regulated data. This enables data professionals to use and share data more freely.

For example, synthetic data enables healthcare data professionals to allow public use of record-level data but still maintain patient confidentiality.

In the financial sector, synthetic datasets such as debit and credit card payments that look and act like typical transaction data can help expose fraudulent activity. Data scientists can use synthetic data to test or evaluate fraud detection systems as well as develop new fraud detection methods. Synthetic financial datasets can be found on Kaggle, a crowdsourced platform that hosts predictive modeling and analytics competitions.

DevOps teams use synthetic data for software testing and quality assurance. Artificially generated data can be plugged into a process without taking authentic data out of production. However, some experts recommend DevOps teams choose data masking techniques over synthetic data techniques because production datasets contain complex relationships that make it hard to manufacture an accurate representation quickly and cheaply.

Synthetic data and machine learning

Synthetic data is gaining traction within the machine learning domain. Machine learning algorithms are trained using an immense amount of data, and collecting the necessary amount of labeled training data can be cost prohibitive.

Synthetically generated data can help companies and researchers build data repositories needed to train and even pre-train machine learning models, a technique referred to as transfer learning.  

Research efforts to advance synthetic data use in machine learning are underway. For example, members of the Data to AI Lab at the MIT Laboratory for Information and Decision Systems documented the recent successes it had with its Synthetic Data Vault that can construct machine learning models to automatically generate and extract its own synthetic data.

Companies are also beginning to experiment with synthetic data techniques. For example, a team at Deloitte LLC used synthetic data to build an accurate model by artificially manufacturing 80% of the training data, using real data as seed data. Computer vision, image recognition and robotics are additional applications that are benefiting from the use of synthetic data.

This was last updated in February 2018

Continue Reading About synthetic data

Dig Deeper on IT applications, infrastructure and operations

Cloud Computing
Mobile Computing
Data Center