Today I’ll discuss a powerful technique from artificial intelligence (AI) that you are probably not using and may not have even heard of. This is partly because it’s typically considered an advanced topic. I’ve never understood this because the topic is not that difficult and is so powerful that I think all data scientists should be aware of its existence. I’ll introduce and list some advantages of *Bayesian networks* in the hope this will make you aware of the technique and motivate you to learn more on your own.

## The Need for Bayesian Networks

Although there exists a wealth of data in today’s world, usually this information is insufficient for business needs. It’s not that there isn’t enough data; rather, the problem is that we need information about quantities that cannot directly be measured. We must use reason to take whatever data is available and translate this into evidence about these hidden quantities. The complexity of today’s business problems and the sheer volume of this data demands that this reasoning process be automated. We can then develop an estimate of the “world state” that generates this data. (see my Quora answer about Data Fusion and Big Data for more details).

Early work in AI used logical and rule-based production systems to attempt to represent uncertainty. This was partially because researchers considered the ideal of using probability calculus to be unattainable at that time. These early approaches suffered from several shortcomings.

One of the more notable problems was that rule-based approaches could only make inferences in one direction. Systems designed to support *diagnostic reasoning* (i.e., from evidence to hypotheses) were unable to reverse direction and perform *predictive inference* (e.g., determining the likely symptoms of a given disease). This meant that these systems were incapable of *intercausal reasoning, *which required bi-directional inference. With intercausal reasoning, we can increase the belief in one hypothesis explaining an observation by obtaining evidence ruling out a competing hypothesis.

The development of Bayesian networks (BNs) in the 1980s provided a mechanism for AI systems to perform bi-directional inference efficiently and with a rigorous probabilistic foundation. It is now the method of choice for reasoning under uncertainty.

## What are Bayesian Networks?

BNs are directed acyclic graphs that specify dependencies between random variables in a system. Nodes of the graph correspond to the random variables. If an arc connects two nodes then these variables are dependent on one another. The topology of the graph exploits any conditional independencies between the variables, and this induces a factorization of the overall joint probability distribution. This factorization, combined with Bayes’s Rule, allows for computationally efficient belief propagation algorithms. When we receive a new piece of evidence regarding one of the random variables in the network, these belief propagation algorithms allow us to update the probabilities in all of the variables through intercausal reasoning.

The direction of the arcs indicates the direction of causality. Nodes on the origin side of an arc are called “parents” and those on the terminal end are “children.” Conditional probability distributions (CPDs) specify how the value of a given random variable is influenced by the values the adjacent nodes. In the case where the parents and children are discrete random variables, all possibilities can be enumerated and the CPD becomes a conditional probability table (CPT).

## A Simple Example

Consider a very simple BN which could be used for modeling the condition of pavement outside. The topology of the network indicates the relationships between the *SEASON* of the year, whether it is currently *RAINING*, the status of the *SPRINKLER* system, and whether the pavement is *WET* and *SLIPPERY*. While the season influences the probability that the pavement is slippery, it does so indirectly through the wetness of the pavement.

Note, if we used this network to model the condition of pavement in a state in which ice forms during the winter, we would need to include an additional arc directly from *SEASON* to *SLIPPERY* to represent the possibility of ice in *winter* making the pavement slippery without being wet. The absence of this arc is an example of using contextual information to simplify the network.

If we know the values of some of the variables, we can infer the rest using Bayes’s rule and the CPDs. One unique property of BNs is that we can reason in any direction we wish, not just those indicated by the direction of the arcs. If the sprinkler is on, we can deduce that the pavement is probably wet. If we see someone slip, we infer that the likely cause is that either the sprinkler is on or it is raining. If we see someone slip and we know that the sprinkler is on, then we can employ intercausal reasoning to explain away the hypothesis that it is raining.

This is a very simple example. A more sophisticated network could replace the *RAINING* binary variable with one that can attain states *NoRain*, *LightSprinkle*, *Moderate*, and *TorrentialDownpour*. It may also include additional nodes to account for other effects, like the proverbial banana peel on the pavement causing someone to slip.

BNs provide a compact graphical representation of a potentially complex world state. The world state is a joint distribution over events and attributes. There are exponentially many of these but BNs factor the joint distribution into local conditional distributions, which simplifies things significantly. In this example, the joint distribution can be written as

using the independence relations encoded in the graphical structure. The number of parameters required by Bayesian network models grows linearly with the size of the network instead of exponentially.

## More Advantages of Bayesian Networks

So far, we’ve talked about the fact that they can perform several types of reasoning: diagnostic (aka abductive, bottom-up), predictive (aka causal, deductive, top-down), and intercausal (aka explaining away). This is a major benefit since this is something humans do easily and yet has traditionally been impossible for computers to do.

We’ve also stated that updating the probabilities of each of the variables in the network can be done through efficient message-passing algorithms. This is true of networks that have a tree-like structure (e.g., no “loops” in the graph). For those networks that do possess loops, they can usually be transformed into polytrees where the message-passing procedures are still applicable. And many studies have shown that message passing still produces acceptable results in networks with loops, even though this is strictly not correct. Message-passing algorithms are very useful, especially on multi-processor architectures.

The compact representation of joint probability distributions reduces the number of parameters necessary for specifying a complex world state. As with most data science modeling techniques, reducing the number of parameters helps protect against overfitting.

In contrast to many powerful models, such as neural networks, Bayesian networks are not *black-boxes* in that their graphical structure makes it far easier for humans to understand. This is one of the more important advantages of Bayesian Networks. There are two benefits to this.

First, humans can play a key role in the construction of the network. In practice, a subject matter expert works in conjunction with someone experienced in designing these networks to specify the random variables and the topology of the network. With the structure mostly determined, the data can then simply be used for learning the CPDs.

Second, their reasoning can easily be understood by those who need to use these models and their answers always come with probabilities attached. This is critically important in some application areas, such as medicine. Bayesian networks have been used as an aid to doctors in diagnosing illnesses. Studies have shown the physicians are reluctant to accept the conclusions of an AI system unless they understand how it came to those answers. It is reasonable to assume that this holds for other application areas as well. Since this reasoning process is (semi-)transparent in the case of Bayesian networks, this means they are applicable to a wider range of problems.

Bayesian networks lie on a rigorous probabilistic foundation, as opposed to the rule- and fuzzy logic-based systems of the early days of AI. Thus, the numbers that come out have a well-understood interpretation.

Many networks feature a semi-modular structure, which makes portions of them reusable. Consider a large Bayesian network that models the internal workings of a specific make and model of automobile. Part of the network would model the brake system, part would model the engine, and so on. These sub-models could be potentially be used in Bayesian networks for other automobiles if the components were similar enough. For example, the part of a network designed to model the brake system of a Toyota car might be able to be reused in an overall model for a Honda, even if the engines were significantly different. Since portions of Bayesian network models can be reused, models might feel it’s a worthwhile investment of resources to develop sophisticated models.

## Further Information

This article was meant to introduce you to some of the advantages of Bayesian networks. This is a field of active research and you can find some great resources for learning more. To name just a few, check out the following:

- Online course: The Probabilistic Graphical Models specialization at Coursera
- Book: The chapter on Bayesian networks in Artificial Intelligence: A Modern Approach by Russell and Norvig
- Book: Probabilistic Graphical Models: Principles and Techniques by Daphne Koller
- Book: Machine Learning: A Probabilistic Perspective by Michael B. Jordan

I’ll also be posting another article soon about *dynamic Bayesian networks* and *dynamic decision networks*, two variants that model time-varying world states. Keep an eye out for them!