Definition

What are oversampling and undersampling?

Gavin Wright

By

Gavin Wright

Published: Jul 21, 2025

Oversampling and undersampling are techniques used in data analytics and statistics to modify unequal data classes to create balanced data sets. Oversampling and undersampling are also known as resampling.

These data analysis techniques are often used to be more representative of real-world data. For example, data adjustments can be made in order to provide balanced training materials for AI and machine learning algorithms.

Where to use oversampling and undersampling

Oversampling and undersampling are used to intentionally introduce a bias into a data set with the intention to make the resulting model more sensitive to a particular group than it may have been otherwise.

One area where oversampling and undersampling techniques are used is for survey research. A survey sample population may be unbalanced in terms of types of participants. By using oversampling or undersampling, the ratios of surveyed characteristics, such as gender, age group and ethnicity, can used to make the weight of the data better representative of the group's ratios within the greater populations.

An area where oversampling and undersampling is needed is training a model to detect fraudulent or malicious activity. One example is a credit card company that had millions of records of valid activity but only thousands of cases of fraudulent activity. Another example is a network monitoring tool that has billions of data points but needs to find a small sample of malicious activity. To train such models, it may be necessary to produce more data of the fraudulent (oversampling) and remove some of the good data (undersampling).

Oversampling and undersampling are also useful when it is more important to identify the minority case accurately. For example, for fraudulent or malicious activity, it is better to inaccurately flag good activity as bad than to miss bad activity and let it pass. So, the small amount of bias that might be introduced by these techniques is acceptable if it makes the resulting model more sensitive.

Oversampling vs. undersampling

When one class of data is the underrepresented minority class in the data sample, oversampling techniques may be used to duplicate these results for a more balanced number of positive results in training. Oversampling is used when the amount of data collected is insufficient.

Conversely, if a class of data is the overrepresented majority class, undersampling may be used to balance it with the minority class. Undersampling is used when the amount of collected data is sufficient. Undersampling may also be used when there is too much data to be easily processed, but this is becoming uncommon as processing and storage become cheaper.

In both oversampling and undersampling, simple data duplication is rarely suggested. Generally, oversampling is preferable as undersampling can result in the loss of important data. Undersampling is suggested when the amount of data collected is larger than ideal and can help data mining tools stay within the limits of what they can effectively process.

Diagram of undersampling and oversampling. — Oversampling and undersampling are techniques used to create balanced data sets, and the technique used by a data scientist depends on the data set and the analysis.

Oversampling techniques

Random oversampling is the simplest method of oversampling. It simply duplicates some of the entries in the underrepresented data set. It is not recommended though as it can cause the resulting model to be overspecialized to the repeated data. Some methods may introduce random noise into the generated samples.

Synthetic Minority Oversampling Technique (SMOTE) generates unique new data based on the existing data. It identifies the characteristics of the data and creates new entries that are reasonable. For example, if the data set of weights had an entry at 150 pounds and another at 160 pounds, it might create an entry at 155 pounds. The resulting data is more diverse and more likely to be representative of a real population.

Adaptive synthetic sampling (ADASYN) is an extension of SMOTE. While typically SMOTE focuses average data in the middle of a set, ADASYN might focus on data at the edges of a data set, which are harder to gather and train for.

Undersampling techniques

Random undersampling removes entries randomly. It is simple to implement but can lose important details in the data set.

Cluster, or centroid, undersampling takes several entries that are similar or close together and replaces them with a single entry.

Condensed nearest neighbor (CNN) takes data entries that are clearly in one class or the other and removes them. This keeps the maximum data points that might help in unclear situations but minimizes the data needed for simple cases.

Tomek Links find pairs from different classes that are near to each other and remove the majority entry. This helps to keep clear boundaries between classes in a data set.

One-Sided Selection combines CNN and Tomek Links to remove excess data from the majority data set, while maintaining the minority data set.

Analytics can be biased, which can hurt profits or lead to social backlash due to discrimination. It's important to fix these biases before problems occur. Explore different types of bias in data analysis and how to avoid them.

Continue Reading About What are oversampling and undersampling?

What is synthetic data? Examples, use cases and benefits

Top trends in big data

How and why to create synthetic data with generative AI

Machine learning regularization explained with examples

Data science project best practices to follow

Search Networking

What is multi-access edge computing? Benefits and use cases
Multi-access edge computing (MEC) is a network architecture concept that brings cloud computing capabilities and IT services ...
What is 5G?
Fifth-generation wireless or 5G is a global standard and technology for wireless and telecommunications networks.
What is a small cell in wireless networks?
A small cell is a type of low-power cellular radio access point or base station that provides wireless service within a limited ...

Search Security

What is identity and access management? Guide to IAM
No longer just a good idea, IAM is a crucial piece of the cybersecurity puzzle. It's how an organization regulates access to ...
What is data masking?
Data masking is a security technique that modifies sensitive data in a data set so it can be used safely in a non-production ...
What is antivirus software?
Antivirus software (antivirus program) is a security program designed to prevent, detect, search and remove viruses and other ...

Search CIO

What is a chief data officer (CDO)?
A chief data officer (CDO) in many organizations is a C-level executive whose position has evolved into a range of strategic data...
What is user-generated content?
User-generated content (UGC) is published information that an unpaid contributor provides to a website.
What is business process outsourcing (BPO)?
Business process outsourcing (BPO) is a business practice in which an organization contracts with an external service provider to...

Search HRSoftware

What is compensation management?
Compensation management is the discipline and process for determining employees' appropriate pay, incentives, rewards, bonuses ...
What is HR technology (human resources tech)?
HR technology (human resources tech) refers to the hardware and software that support an organization's human resource management...
What is core HR (core human resources)?
Core HR (core human resources) is an umbrella term that refers to the essential, mandatory and fundamental tasks and functions of...

Search Customer Experience

What are virtual agents and how are they being used?
A virtual agent is an AI-powered software application or service that interacts with humans or other digital systems in a ...
Customer acquisition cost (CAC): How to calculate and reduce it
Customer acquisition cost (CAC) is the cost associated with convincing a consumer to buy your product or service, including ...
What is direct marketing?
Direct marketing is a type of advertising campaign that seeks to elicit an action (such as an order, a visit to a store or ...

Close