Data Dredging (data fishing)
What is data dredging (data fishing)?
Data dredging -- sometimes referred to as data fishing -- is a data mining practice in which large data volumes are analyzed to find any possible relationships between the data. Data scientists can then form hypotheses about why these relationships exist. In contrast, traditional scientific data analyses methods begin with a hypothesis, followed by data examination to prove or disprove the hypothesis.
When conducted for unethical purposes, data dredging often circumvents traditional data mining techniques which may lead to premature conclusions.
More about data dredging
Data dredging is sometimes described as "seeking more information from a data set than it actually contains." The practice is known by other names such as fishing trip, data snooping and p-hacking.
It can be a useful way to find surprising relationships between variables that might not have been discovered otherwise.
However, most data dredging is used improperly, an action that is unintentional rather than malicious. Often, this is due to a lack of understanding about how data mining techniques should be applied to find previously unknown relationships between different variables. And it results in a failure to acknowledge that a discovered correlation was, in fact, coincidental.
Data dredging could lead to an increase in false positive results. This occurs when investigators announce relationships between variables as "significant" when, in fact, the data requires more study before such an association can legitimately be determined and announced. Isolated variables may also need to be contrasted with a control group to make a valid assessment of the relationship between any two variables and to ensure the relationship is not merely coincidental.
P-value and its relationship with data dredging
A p-value is the probability that a statistical summary of data would be equal to or more extreme than its observed value under a specified statistical model. If the p-value is less than or equal to 0.05, it is accepted as statistically significant since it indicates there is strong evidence against the null hypothesis.
To establish that they have gathered "significant" data, investigators may pick data that is suitable to fit their hypothesis or claim and exclude data that doesn't fit this hypothesis. Whether they do this cherry-picking consciously or unconsciously, the data may show correlations between variables that are simply coincidental or that don't exist. Such conclusions can lead to serious factual distortions and the spread of false information.
Some unethical researchers do data dredging to probe data and report a result that yields the lowest possible p-value. Then, they report a statistically significant result even though the result is a false positive and therefore unreliable. The good news is that unethical researchers are uncommon, and this type of problem usually arises from a lack of awareness.
Impact of data dredging
Data dredging can have a negative effect on research studies, often without the investigator's knowledge. When done deliberately, it is an unethical practice that can skew the results of studies and trials. It can also mislead the public or anyone who has a stake in the results.
Some common effects of data dredging include the following:
- generation of false positives, affecting the reliability of the results;
- misleading other investigators and affecting the results of their studies;
- increasing the bias of a study;
- loss of crucial resources, especially manpower;
- forcing researchers to retract published studies; and
- possible loss of trial funding.
Data mining vs. data dredging
The terms data mining and data dredging are often used interchangeably, even though they are different concepts. Data dredging usually occurs when data mining is abused.
In data mining, large datasets are examined to identify links between different variables. It involves the analysis of collected data to find relationships between different variables. Then a hypothesis is formed about why these relationships exist. As more computing power becomes available, data mining has emerged as a useful research tool to analyze larger volumes of data than was previously possible.
In contrast, data dredging typically involves examining a particular data set multiple times to find relationships between variables. Often, these relationships exist by chance and are false positives rather than true results. If precautions are not taken, data dredging can be used in unethical ways to generate results that look genuine but are not reliable.
Preventing data dredging
Sometimes, researchers or investigators rely on flawed practices such as gathering more data after assessing interim results. They may also assess the effect of excluding outliers from their results and then opt to do it. These types of practices introduce bias in their studies and generate unreliable results.
To prevent these flawed practices, robust rules on when to stop data collection and how to analyze data to detect meaningful effects should be encouraged. Authors should also follow rules on proper handling of outliers, expected transformations of variables and which covariates to control for.
Also, published studies should list all the variables collected in the study. They should specify all planned analyses, including primary and secondary outcomes. It is also advisable to include robustness analyses for the methodological choices made by the investigators during the study.
A broader intervention is to update the standards to conceive, conduct and publish scientific research.
See also: Data science vs. machine learning vs. AI: How they work together, 15 common data science techniques to know and use and 4 data science project best practices to follow.