Definition

Data Dredging (data fishing)

Rahul Awati

By

Rahul Awati

Published: Apr 14, 2022

What is data dredging (data fishing)?

Data dredging -- sometimes referred to as data fishing -- is a data mining practice in which large data volumes are analyzed to find any possible relationships between the data. Data scientists can then form hypotheses about why these relationships exist. In contrast, traditional scientific data analyses methods begin with a hypothesis, followed by data examination to prove or disprove the hypothesis.

When conducted for unethical purposes, data dredging often circumvents traditional data mining techniques which may lead to premature conclusions.

More about data dredging

Data dredging is sometimes described as "seeking more information from a data set than it actually contains." The practice is known by other names such as fishing trip, data snooping and p-hacking.

It can be a useful way to find surprising relationships between variables that might not have been discovered otherwise.

However, most data dredging is used improperly, an action that is unintentional rather than malicious. Often, this is due to a lack of understanding about how data mining techniques should be applied to find previously unknown relationships between different variables. And it results in a failure to acknowledge that a discovered correlation was, in fact, coincidental.

Data dredging could lead to an increase in false positive results. This occurs when investigators announce relationships between variables as "significant" when, in fact, the data requires more study before such an association can legitimately be determined and announced. Isolated variables may also need to be contrasted with a control group to make a valid assessment of the relationship between any two variables and to ensure the relationship is not merely coincidental.

4 stages of data mining

P-value and its relationship with data dredging

A p-value is the probability that a statistical summary of data would be equal to or more extreme than its observed value under a specified statistical model. If the p-value is less than or equal to 0.05, it is accepted as statistically significant since it indicates there is strong evidence against the null hypothesis.

To establish that they have gathered "significant" data, investigators may pick data that is suitable to fit their hypothesis or claim and exclude data that doesn't fit this hypothesis. Whether they do this cherry-picking consciously or unconsciously, the data may show correlations between variables that are simply coincidental or that don't exist. Such conclusions can lead to serious factual distortions and the spread of false information.

Some unethical researchers do data dredging to probe data and report a result that yields the lowest possible p-value. Then, they report a statistically significant result even though the result is a false positive and therefore unreliable. The good news is that unethical researchers are uncommon, and this type of problem usually arises from a lack of awareness.

Impact of data dredging

Data dredging can have a negative effect on research studies, often without the investigator's knowledge. When done deliberately, it is an unethical practice that can skew the results of studies and trials. It can also mislead the public or anyone who has a stake in the results.

Some common effects of data dredging include the following:

generation of false positives, affecting the reliability of the results;
misleading other investigators and affecting the results of their studies;
increasing the bias of a study;
loss of crucial resources, especially manpower;
forcing researchers to retract published studies; and
possible loss of trial funding.

Data mining vs. data dredging

The terms data mining and data dredging are often used interchangeably, even though they are different concepts. Data dredging usually occurs when data mining is abused.

In data mining, large datasets are examined to identify links between different variables. It involves the analysis of collected data to find relationships between different variables. Then a hypothesis is formed about why these relationships exist. As more computing power becomes available, data mining has emerged as a useful research tool to analyze larger volumes of data than was previously possible.

In contrast, data dredging typically involves examining a particular data set multiple times to find relationships between variables. Often, these relationships exist by chance and are false positives rather than true results. If precautions are not taken, data dredging can be used in unethical ways to generate results that look genuine but are not reliable.

Preventing data dredging

Sometimes, researchers or investigators rely on flawed practices such as gathering more data after assessing interim results. They may also assess the effect of excluding outliers from their results and then opt to do it. These types of practices introduce bias in their studies and generate unreliable results.

To prevent these flawed practices, robust rules on when to stop data collection and how to analyze data to detect meaningful effects should be encouraged. Authors should also follow rules on proper handling of outliers, expected transformations of variables and which covariates to control for.

Also, published studies should list all the variables collected in the study. They should specify all planned analyses, including primary and secondary outcomes. It is also advisable to include robustness analyses for the methodological choices made by the investigators during the study.

A broader intervention is to update the standards to conceive, conduct and publish scientific research.

See also: Data science vs. machine learning vs. AI: How they work together, 15 common data science techniques to know and use and 4 data science project best practices to follow.

Continue Reading About Data Dredging (data fishing)

8 types of bias in data analysis and how to avoid them

5 ways AI bias hurts your business

Data-rich organizations turn focus to ethical data mining

Good data-driven decision-making avoids common pitfalls

5-step predictive analytics process cycle

Dig Deeper on Data management strategies

Search Business Analytics

What makes an effective data science team structure?
Data science team structures vary in strength, and their success depends on how roles and leadership align with business goals to...
Synthetic data vs. real data for predictive analytics
Synthetic data helps simulate rare events and meet privacy compliance, while real data preserves natural variability needed to ...
7 predictive analytics skills to improve simulation modeling
Predictive analytics skills such as statistical analysis, data preprocessing and model evaluation can help data professionals ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Search Content Management

The top 10 RFP response software
As B2B organizations grow, the RFP response process can become too time-consuming for manual workflows. Top tools, such as Loopio...
CRM vs. CMS: How they differ and how to integrate them
CMSes and CRM systems serve different purposes, but together, they can help organizations improve customer data management and ...
How to accomplish a SharePoint-Teams integration
Depending on the complexity of a business's SharePoint sites, a Teams integration can benefit organizations by being ...

Search Oracle

Click-to-launch tools pull apps through Oracle Cloud Infrastructure marketplace
Oracle has made it easier for customers to choose and launch third-party software onto its cloud. Now, the question is whether ...
Willis develops app to put a personal touch back in voluntary benefits
Part two of a two-part article: Willis uses PeopleSoft 9.1 to bring back the personal feel to automated insurance selection for ...
Willis develops app for real-time voluntary benefit selection
Part one of a two-part article: Willis uses PeopleSoft 9.1 to create real-time automated insurance selection for voluntary ...

Search SAP

SAP agrees to allow Celonis data access until case resolved
SAP agrees to allow Celonis customers to access data from its systems as their legal battle continues, but customers will be best...
Grow with SAP fuels Phoenix Global's digital transition
Phoenix Global implemented S/4HANA Cloud via Grow with SAP to replace outdated systems, digitize manual processes and enable AI ...
SAP Sapphire 2025 news, trends and analysis
SAP showcased new business AI applications and continued to make the case for S/4HANA Cloud as the future of SaaS-based ERP ...

Close