This is the first piece in a three-part series.
There is a lot of controversy in business circles as to whether organizations use AI technology for unethical purposes. This blog isn’t about that at all.
From a data scientist’s point of view, ethical AI is achieved by taking precautions to expose what the underlying machine learning model might learn that could impute bias. At first glance, latent features of the model or relationships between data may not appear to be biased, but deeper inspection shows that that the analytic results that the model produces are biased toward a particular data class.
Bias can be imputed through confounding variables
One of the most common misperceptions I hear about bias is “If I don’t use age, gender or race, or similar factors in my model, it’s not biased.” Well, that’s not true. Even though the same people holding this opinion know that machine learning can learn relationships between data, they don’t understand that there are proxies to biased data types in other features that are captured. These proxies are called confounding variables and, as the term indicates, unintended variables can confuse the model into producing biased results.
For example, if a model includes the brand and version of an individual’s mobile phones, that data can be related to the ability to afford an expensive cell phone — a characteristic that can impute income. If income is not a factor desired to use directly in the decision, imputing that information from data, such as the type of phone or the size of the purchases that the individual makes, introduces bias into the model. A high dollar amount on purchases can indicate that an individual is more apt to potentially make these types of transactions over time, again imputing income bias.
Research into the effects of smoking provides another example of confounding variables. In decades past, research was produced that essentially made the correlation, if your smoke, your probability of dying in the next four years is fairly low; that must mean smoking is OK. The confounding variable in this assumption was the distribution of smokers. In the past, the smoking population contained many younger smokers whose cancer would develop later in life. The older smokers were already deceased. Thus, the analytic model contained overwhelming bias and created a biased perception about the safety of smoking.
In the 21st century, similar bias could be produced by a model concluding that, since far fewer young people smoke cigarettes than 50 years ago, nicotine addiction levels are down, too. However, youth use of e-cigarettes jumped 78% between 2017 and 2018, to one out of every five high-school students. E-cigarettes are potent nicotine delivery devices, fostering rapid nicotine addiction.
The challenge of delivering truly ethical AI requires closely examining each data class separately. As data scientists, we must demonstrate to ourselves and the world that AI and machine learning technologies are not subjecting specific populations to bias.
All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.