Bayesian networks have become popular tools for enterprise data scientists working with prediction, as the rise of cheap and abundant cloud computing has made way for adaptable infrastructure.
In the more distant past, Bayesian networks remained largely conceptual since most developers and businesses lacked the necessary computing power. Now, organizations can use cloud-based infrastructures for several different types of problem solving in parallel as opposed to sharing time on an expensive supercomputer. A powerful probabilistic model, Bayesian networks are seeing rising popularity due to their easily explainable outcomes and ability to correlate variables.
What is a Bayesian network?
As the name implies, Bayesian networks are based on Bayes Theorem of conditional probability, a branch of statistics. A Bayesian network is a graphical model that represents a set of variables and their conditional dependencies using a directed acyclic graph.
Instead of giving a static weight to variables, a Bayesian network provides a probability distribution to create a full picture of the variable's impact on the final result. As such, developers and users can track the metric and variable, as well as its effect in a more straightforward way.
Chris Calvert, cofounder and CTO of security operations center software provider Respond Software, sees the value of Bayesian networks value in their opaqueness.
"With neural networks, you sort of have to take a lot of things on trust whereas with probabilistic graphical modeling-type approaches in Bayesian networks, you can construct a decision that's easily supervised," Calvert said.
The algorithms used in Bayesian networks are capable of inference and learning. In cybersecurity, Calvert said, they are used to make decisions humans have traditionally made -- humans can only consider three to five factors at a time, whereas a Bayesian network can consider every observable factor.
Bayesian networks have been used in nonbusiness settings quite frequently, and enterprises are starting to take note. Using Bayesian networks, data scientists can predict the likelihood that one of several possible causes was the contributing factor to an outcome, which lends itself to a variety of applications for enterprise use.
Cybersecurity researchers use Bayesian reasoning and Bayesian networks to identify malware. For one thing, identifying malware requires an organization to look at all the log files, which is a tedious and boring task ill-suited to humans. Moreover, given the explosion of internet-connected devices, applications, browsers and OSes, humans couldn't keep up if they tried. Calvert said he has even used Bayesian networks to monitor 50 security operations centers simultaneously.
"A human isn't going to know that this IP address [translates] to this hexadecimal because humans don't think in hex," he said. "Machines naturally think in hex."
Bayesian networks are also used to improve the accuracy of microtargeting. For example, if one were to build a digital application for a bank, video game or healthcare company using demographics or personas, that person would make erroneous assumptions about individuals because the audience segmentation is too coarse and not all members of those groups are as identical as they're assumed to be.
Charlie Burgoyne, CEO and founder of data science consulting firm Valkyrie Intelligence, said a more effective approach is to group people by behavior, since a millennial white male and a boomer African American female may act similarly online.
In predictive analytics, this common problem can restrict neural networks and advanced AI's ability to target users accurately.
"A Bayesian network is a great way to trying to predict whether or not I will have a positive user experience that increases engagement based off of different activities that I engage with that have probabilities of success,” Burgoyne said.
Data anonymization technology provider Statice uses Bayesian networks to produce synthetic data. Synthetic data is used for data analytics and machine learning purposes when the use of actual data would be too risky (e.g., healthcare and financial records) or when data is too difficult or costly to collect.
"What I find very powerful is that you can model the data and the certainty of the generative process together," said Jose Pedro Valdes Herrera, data scientist at Statice. "We need good generative models that can create synthetic data [and that] are compatible with whatever privacy preserving mechanisms you need to put in there."
Limitations of Bayesian networks
Bayesian networks require a lot of labeled data to understand relationships. Calvert said the data labeling needs to be crowdsourced so the model can have the data it needs to become "superintelligent."
"Early on, it was difficult for us to get enough security data because it's supersensitive, so we had to show people how this works -- more data, better decisions, more data, better decision," Calvert said. "Before long, [a Bayesian network is] making better decisions than humans have ever been able to make and we're good to go, but there's that window that's really hard to get through."
Another limitation of Bayesian networks applications has to do with conditions and probabilities. For example, as Burgoyne pointed out, each incremental step in a Bayesian network is a superposition of multiple sub-probabilities, one on top of another.
"The likelihood of an egg being rotten isn't a single probability," Burgoyne said. "Many thousand different types of sub-probabilities fit into that [scenario] that are incomplete when summarized as a simple probability for one of the nodes on a graph. So, a limitation is that even though a Bayesian network is considering the complexity of the relationships, it is dependent on a summarization that is often incomplete."
Bayesian networks are just one of many tools data scientists use to make predictions. Given their statistical underpinnings (Bayes Theorem), they're suited to problems involving conditional probabilities.
Many of today's problems involve understanding relationships, which is why organizations have been supplementing their traditional relational and multidimensional databases with graph databases. Similarly, a range of data analysis and machine learning techniques are necessary to answer different types of questions. A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies.