It’s commonly assumed that one of the great advantages of big data is its impact on statistical analysis: The thinking is that any statistical test is improved as the sample size grows. Flipping a coin twice and getting heads twice doesn’t mean much. But if you flip a coin 100 times and it comes up heads every time — that means a lot.
But all big data is not necessarily better data. That was the message to data enthusiasts from eminent statistician Xiao-Li Meng, the Whipple V. N. Jones Professor of Statistics at Harvard University, at a recent seminar on the Cambridge, Mass. campus.
“When you take into account the quality of the data, sometimes your seemingly very large data set becomes tiny,” Meng said. Call it the big data paradox.
Case in point — and the topic of a recent paper by Meng on this big data paradox — is the 2016 U.S presidential election. “Any way you look at it, the election of Hillary Clinton looked like a foregone conclusion,” Meng said, noting the “greater than 90%” survey stat bandied about in the days before the election. “And we all know what happened.”
The utter failure to predict the outcome was painful for statisticians — and an opportunity to figure out what happened. Armed with election data provided by a Harvard colleague, Meng said he could now explain at least one of the perhaps many answers for the prediction failure.
Homogeneity in large data sets
Here is Meng’s non-math explanation of the big data paradox for people like me in the audience:
Let’s say, for example, you want to conduct a survey in China using “n” people and the same survey in the United States using “n” people. The population of China is about four times that of the United States. You want to survey with statistical accuracy. So what should n/n be? Should the ratio be four times or two times or the same?
“The correct answer (as with most things in life) is it depends,” Meng said.
That’s because, according to Meng, it doesn’t matter how large the population size; it is the sample size that matters and — here’s the clincher — the data quality of the sample size. Meng suggested we think about sample size as a soup: If you are trying to determine how salty or how delicious a soup is, no matter how large the container, as long as the contents are mixed well, you only need a few spoonfuls.
“So the whole idea in statistical inference that only the sample size matters is based on the assumption that the sample size has been mixed well,” Meng said.
Big data paradox exacerbates bias for Clinton
To mix well for large populations is much harder to do than for small populations — and the 2016 voting surveys showed just how poorly the mixing was done, Meng said.
Why? The surveys failed to account for the population that refused to answer the question “Who are you planning to vote for?”
People who wanted to vote for Trump were somewhat less likely to answer the question than Clinton supporters, perhaps because they suspected “it was not a popular answer,” Meng said. “They are the shy ones, and if you didn’t take that into account, your survey would show that people are overwhelmingly voting for Clinton.”
Indeed, people who are purposely not responding destroy the statistician’s flip of the coin. Not accounting for people who declined to answer was a fatal mistake in the 2016 survey analyses, big data notwithstanding.
In fact, big data made the error worse, masking the population of people who were planning to vote for Trump but did not want to say it, Meng explained.
Bottom line on the big data paradox: What matters most in data analysis is the quality not the quantity of the data. Missing this truth may lead one astray in almost any big data study.
For the mathematicians among you, here is Meng’s paper: “Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election. Annals of Applied Statistics, Vol. 12, No. 2, 685-726.