Gajus -

Bias in big data: How to find it and mitigate influence

It's no secret that bias exists in large data sets, ; the key is addressing it. With transparency, diversity and accountability, limiting that bias can be possible.

Bias can occur along different parts of the data pipeline. There is a lot of focus on potential bias that occurs during analytics, but bias in big data can be inserted or interpreted even earlier in the data pipeline.

Bias has the potential to enter the data lifecycle as early as collection, according to Kelly Capatosto, senior research associate at the Kirwan Institute for the Study of Race and Ethnicity at The Ohio State University.

"If someone is generating the use of surveys that are subsequently going to inform how a program, model or algorithm operates, [designers'] preconceived notions could end up baked into the process," she said.

There has been growing scrutiny of large data sets and the amount of bias they contain. And while there can be advantages to intentional bias in areas such as target marketing, where a bias in data can provide more direct insight, bias in big data can quickly become an issue for business.

Here are some ways to find and mitigate bias that puts enterprises at a disadvantage.

Where to find it

"[Bias enters data] early in the lifecycle," said Mike Leone, senior analyst at Enterprise Strategy Group (ESG).

This bias can be included unintentionally, even during data collection. As Capatosto said, certain biases may be ingrained into the surveys used to collect the data, but bias can also enter the collection process due to access barriers. One example of this is the recent U.S. Census.

There are barriers around language, around poverty -- just access to technology -- that make it difficult to meaningfully incorporate that kind of information into any given process.
Kelly CapatostoSenior research associate, Kirwan Institute for the Study of Race and Ethnicity at The Ohio State University

"There are barriers around language, around poverty -- just access to technology -- that make it difficult to meaningfully incorporate that kind of information into any given process," Capatosto said.

Another thing to look for is conflation. Capatosto said one of the most prominent issues with algorithmic bias is conflating identity with a level of risk.

"This is really important in the healthcare context [with] contact tracing and how to utilize big data and other measures to build out the public landscape and infrastructure around health concerns," Capatosto said.

She referenced a recent study that found certain communities were alleged to have more prevalent healthcare needs, but the parameter for those needs was money spent on healthcare. Additional research found that cost was a biased parameter because sometimes cost itself is a barrier to accessing healthcare in the first place.

Intentional or unintentional?

While it's important to keep malicious intent out of bias in big data, there are times where it's necessary to include a bias.

"If you have a large data set, you might want to know only about a certain population," said Svetlana Sicular, vice president analyst at Gartner.

Intentional bias is somewhat the point of analytics, according to Leone. When it comes to personalization or reaching target demographics specifically, biased data sets can help achieve those goals.

"Bias enables a truly customized experience for each and every customer," he said.

But using targeted data sets to customize an audience's experience isn't causing the big problem of bias in big data. The bigger issue is unintentional bias.

"Unintentional bias could be done with different purposes, including malicious purposes, like poisoning the data or hacking models," Sicular said.

Sicular said one example of unintentional bias in big data leading to issues is the launch of the Apple Card in 2019. Soon after its release, the algorithm was found to set credit limits lower for women. While Apple and Goldman Sachs alleged the discrimination was unintentional, examples like this where bias in big data was done unintentionally can be the biggest issue to address.

"A lot of technologies, a lot of platforms are trying to get explainability, explaining what's happening inside the algorithms," Sicular said.

Building that explainability throughout the data pipeline can mitigate unintentional bias. Capatosto said it starts with literacy and familiarity with concepts of bias in big data. The key is to build transparency and a strong data governance process that works to remove bias in your data sets.

"First and foremost, just ensure that accountability isn't merely an afterthought," she said.

Diversity as a factor

A diverse data team can identify when an intentional bias is necessary, Leone said. A team that includes people with diverse backgrounds will have questions based on their different experiences that can change the approach, Sicular said.

"One single person might not consider certain things," she said.

How a diverse team is being employed matters, though. The goal isn't simply to have a diverse team to calibrate data. Those different perspectives are necessary throughout the pipeline -- from design through implementation, Capatosto said.

Diversity in a data team can play a large part in limiting bias in big data before it gets too late, she added.

"I think that having more diversity of opinion, perspective and vantage point is always going to be helpful at identifying those fixes early on," she said.

Enterprise Strategy Group (ESG) is a division of TechTarget.

Dig Deeper on Data management strategies

Business Analytics
Content Management