Convolutional neural networks and recurrent neural nets underlie many of the AI applications that drive business value. Learn about CNNs vs. RNNs in this primer.
To set realistic expectations of AI -- without missing opportunities -- it is important to understand algorithms, both their capabilities and limitations.
In this article, we explore two algorithms that have propelled the field of AI forward -- convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We will cover what they are, how they work, what their limitations are and where they complement each other.
But first, a brief summary of the main differences between a CNN vs. an RNN.
CNNs are commonly used in solving problems related to spatial data, such as images. RNNs are better suited to analyzing temporal, sequential data, such as text or videos.
A CNN has a different architecture from an RNN. CNNs are "feed-forward neural networks" that use filters and pooling layers, whereas RNNs feed results back into the network (more on this point below).
In CNNs, the size of the input and the resulting output are fixed. That is, a CNN receives images of fixed size and outputs them to the appropriate level, along with the confidence level of its prediction. In RNNs, the size of the input and the resulting output may vary.
Use cases for CNNs include facial recognition, medical analysis and classification. Use cases for RNNs include text translation, natural language processing, sentiment analysis and speech analysis.
ANNs, CNNs, RNNs: What are neural networks?
The neural network was widely recognized at the time of its invention as a major breakthrough in the field. Taking a hint from how the neurons in our brains work, neural network architecture introduced an algorithm that allowed the computer to fine-tune its decision-making -- in other words, to learn.
An artificial neural network, or ANN, consists of many perceptrons. In its simplest form, a perceptron consists of a function that takes two inputs, multiplies them by two random weights, adds them together with a bias value, passes the results through an activation function and prints the results. The weights and the bias values are adjustable, and they define the outcome of the perceptron, given two specific input values.
This architecture was genius: combining the perceptrons generated layers of adjustable variables that could take on almost any task. The problem, though, was what numbers to pick for the weights and the bias values to make a correct calculation.
Bias in artificial neurons
In both artificial and biological networks, when neurons process the input they receive, they decide whether the output should be passed onto the next layer as input. The decision of whether to send information on is called bias and it's determined by an activation function built into the system. For example, an artificial neuron may only pass an output signal onto the next layer if its inputs (which are actually voltages) sum to a value above some particular threshold value. -- Linda Tucci
This was taken care of via a mechanism called backpropagation. The ANN is given an input, and the result is compared to the expected output. The difference between the desired output and the actual output is put back into the neural network via a mathematical calculation, which determines how each perceptron should be adjusted to reach the desired result.
This procedure -- where the AI is trained -- is repeated until a satisfying level of accuracy is reached.
A neural network like this works great for simple statistical predictions, such as predicting a person's favorite football team, given the person's age, gender and geographical location. But how can AI be used for more difficult tasks such as image recognition? The answer begs the question of how do we feed the data into the network in the first place.
This chart outlines the chief differences between a convolutional neural network and a recurrent neural network.
Convolutional neural networks
What we see as images in a computer is actually a set of color values, distributed over a certain width and height. What we see as shapes and objects appear as an array of numbers to the machine. Convolutional neural networks make sense of this data through a mechanism called filters and then pooling layers.
"A filter is a matrix of randomized numbers. In a CNN, filters are multiplied against matrix representations of parts of the image, effectively scanning the picture pixel by pixel and getting the average value of all adjacent pixels, thereby detecting the most important features," explained Ajay Divakaran, the senior technical director of the Vision and Learning Laboratory in SRI International's Center for Vision Technologies, a nonprofit scientific research institute.
"This information is passed through a pooling layer, which condenses the acquired feature map into its most essential information," he added. This last step greatly reduces the size of the data and makes the neural network much faster. The resulting information is then fed into the neural network.
A CNN consists of several layers of perceptrons, and the filters effectively build a network that understands more and more of the image with every passing layer. While the first layer understands the outlines and borders, the second layer starts understanding shapes, and the third one understands objects. The power of this model is its capability to recognize objects, regardless of where in the picture they appear or their rotation.
CNNs are great at recognizing objects, animals and people, but what if we want to understand what is happening in the pictures?
For instance, consider a picture of a ball in the air. How can we know if the ball is thrown and going up or if it is falling? Answering this question would require more information than a single picture -- we would need a video. The sequence of the pictures would determine if the ball is going up or down. But how can we make neural networks remember the information they had previously worked on and work that into their calculation?
Recurrent neural networks
The problem of remembering goes beyond videos -- in fact, many natural language understanding algorithms (that typically only deal with text) require some sort of remembering, such as the topic of the discussion or the previous words in the sentence.
Recurrent neural networks were designed to tackle exactly this problem. This algorithm feeds the result back into itself, making it a part of the final answer.
To illustrate, assume we want to translate the following sentence: "What date is it?" The algorithm feeds each word separately into the neural network, and by the time it arrives at the word "it," its output is already influenced by the word "What."
RNNs do have a problem, though. In the previous example, the words that are fed last into the network have a higher influence on the result (in our case, the words "is it?"). Those two words are not giving us much understanding of the full sentence -- the algorithm is suffering from "memory loss." This issue has not gone unnoticed, and newer algorithms such as Long Short-Term Memory (LSTM) solve that problem.
The diagram below, from Wikimedia Commons, shows a one-unit recurrent neural network.
This diagram, courtesy of Wikimedia Commons, depicts a one-unit RNN. From bottom to top: input state, hidden state, output state. U, V, W are the weights of the network. Compressed diagram on the left and the unfold version of it on the right.
CNNs vs. RNNs: Strengths and weaknesses
Having seen how each network was designed, we can now point out the strengths and weaknesses of each.
"CNNs are preferred in interpreting visual data, sparse data or data that does not come in sequence," explained Prasanna Arikala, CTO at Kore.ai, a chatbot development company. "Recurrent neural networks, on the other hand, are designed to recognize sequential or temporal data. They do better predictions considering the order or sequence of the data as they relate to previous or the next data nodes."
Nowadays, the boundaries between CNN and RNN usage are somewhat blurred.
Fred NavruzovData science lead, Competera
Applications where CNNs are particularly useful include face detection, medical analysis, drug discovery and image analysis, Arikala said. RNNs are useful for language translation, entity extraction, conversational intelligence, sentiment analysis and speech analysis.
Because RNNs rely on the previous state to predict the future state, they "make sense for the stock market, as predicting where a stock would be headed depends a lot on where it has been earlier," he said.
However, as we learned earlier, when scanning a picture, a CNN's filter takes the adjacent pixels into account as it works. Could it not use the same mechanism for adjacent words?
"It is not that such an approach would not work at all," Divakaran explained. "[But] it's a needlessly roundabout approach." According to Divakaran, trying to use the spatial modeling capabilities of the CNN to capture what is basically a temporal phenomenon is suboptimal by definition and requires much more effort and memory to accomplish the same task.
CNNs vs. RNNs: Complementary models
But there are cases where the two models complement each other. Arikala shared an interesting case.
"For some of the Asian languages like Chinese, Japanese and Korean, where characters are like special images, we use deep neural networks built with a combination of CNN and RNN for intent detection and sentiment analysis," he said.
In these so-called logographic languages, some characters can translate to one or several English words, while others only mean something when they are suffixed to other characters, changing the meaning of the original character.
"The reason why a combination of neural networks works here is that we do character tokenization in logographic languages compared to [using] Treebank/WordNet tokenization in other languages," Arikala explained. "A combination of CNN and LSTM works much better than pure RNN."
Fred Navruzov, the data science lead at Competera, an AI company that helps retailers set optimal prices, agreed that the models can cooperate instead of compete with each other.
"Nowadays, the boundaries between CNN and RNN usage are somewhat blurred, as you can combine those architectures into CRNN for increased effectiveness in solving specific tasks like video tagging or gesture recognition," he said. In an analysis of a sequence of video frames, for example, RNN can be used to capture temporal information and the CNN can be used to extract spatial features from single frames.
Dig deeper into the expanding universe of neural networks
It's important to appreciate that CNNs and RNNs are just two of the most popular categories of neural network architectures. There are dozens of other approaches to organizing the way neurons connect together, and some that were obscure a few years ago are seeing significant growth today.
Examples of new neural networks include the following:
Transformers, which are helping to sort out many of the limitations with RNNs in processing large bodies of text, audio or video streams and for building large language models such as Google BERT.
Generative Adversarial Networks, which combine multiple competing neural networks to make it possible to design drugs, generate digital fakes or improve media production.
Autoencoders, which are becoming the tool of choice for dimensionality reduction, image compression and data encoding.
In addition, AI services are finding ways to automatically create new, highly optimized neural networks on the fly using neural architecture search. These techniques create a starting architecture for a particular problem and interactively analyze the results to fine-tune better architectures.
Current implementations of automated machine learning include Google's AutoML, IBM Watson's AutoAI, and the open source AutoKeras. Researchers are also exploring better ways of combining multiple neural network models of the same or different architectures using ensemble learning techniques.
Better techniques for comparing the performance and accuracy of neural network architectures could also play a role in making it easier for researchers to sift through many options for a particular AI task. Researchers are starting to find creative ways to apply traditional statistical techniques to compare the relative performance of different neural network architectures.