Download the presentation: Accelerating deep neural networks with analog nonvolatile memory devices
00:00 Geoff Burr: Good day. My name is Geoff Burr. I work at the IBM Almaden Research Center in San Jose, California, and the title of my talk is "Accelerating Deep Neural Networks with Analog Memory Devices."
I'd like to start by thanking the conference committee for giving me the opportunity to give this talk and for accommodating the constraints of virtual presentation. If 2020 has taught us anything, it's to take nothing for granted. It's a great honor to have this opportunity to present a keynote talk at the Flash Memory Summit. I can only hope that by the time of next year's summit, we'll all be able to meet and discuss in-person again. With that, let me start my talk.
00:38 GB: We are living through an information explosion. Every two days, the world generates the same amount of data as it had generated from the dawn of the information age up until 2003. Every two days. This is zettabytes of data, 1021. An enormous amount of this data is unstructured like raw videos, images and audio. We need tools to help us deal with this much data. It turns out one of the powerful tools we have actually originated more than 100 years ago with Nobel Prize-winning studies of the neurons and synapses in our own brains.
01:15 GB: This then motivated work in the 1950s on mathematical models of neurons and synapses called artificial neural networks. Over the next 50 years or so, this research proceeded slowly, alternating between periods of intense interest and periods of extreme disillusionment. This all changed in 2011 or so, when the three pillars of deep learning finally came together. We had actually had the algorithms for years. I myself learned both the backpropagation algorithm and stochastic gradient descent in graduate school in 1992. I'll explain both these terms in a few slides.
01:54 GB: What we were missing back then was massive amounts of data to train these networks with. We needed the internet to make this possible, and a very large amount of parallel compute, which we had by 2011 thanks to the popularity of video games. This combination of algorithm, data and compute quickly set new records in image recognition. And in the plot, you could see a huge drop in classification error in 2012, and only four years later, surpassing human capabilities. In the upper right corner, you can see that the word error rate of speech recognition systems began to drop towards human level. And this enabled things like automated call centers, Alexa, Siri, and Google Assistant.
02:35 GB: Most recently, natural language process consistence can be fed an initial sequence of word tokens as shown here in blue, and then start to generate reasonable prose -- as shown in black text -- by simply predicting the most likely next word, one after the other. All these was made possible by the scalability of deep learning. When you make a neural network model bigger, and you train it with more data, it almost always gets better and, eventually, it will outperform anything else.
03:02 GB: Let me walk you through how these networks work. Although here I have drawn a simple fully connected network, any deep neural network is composed of layers of neurons interconnected by synaptic weights. Although you can pick any number and size of hidden layers, you design the input layer to match your data.
Here, I'm showing 2D images of handwritten digits from a 20-year-old data set known as MNEs. And you design your output layer to match your task, here classifying into one of 10 classes. When you have a trained network, you could perform forward inference. You apply your input data, and each layer of neurons drives the next through their weight connections at the output. Whichever neuron is driven with the strongest excitation represents the network's guess at your class. This will never be 100% correct, but if you did a good job during training, the accuracy can be pretty high. And, as we saw earlier, this can be even better than what a human would do.
03:57 GB: But how do we train this network and get a good set of weights? Training starts by doing the same forward propagation from input data. But, at first, because the weights are just random guesses, the answer is likely wrong. We need to have the correct answer or label for each piece of training data. You can think of the error, the difference between the network's guess and this label, like an energy function. The closer the guess and the label, the lower the energy. Then we backpropagate these errors from right to left, back through the network. And in each weight, we combine the error from the downstream neuron and the original excitation of the upstream neuron to produce the right weight update.
04:34 GB: If we look at weight 1,1 we can think of this as, how do we need to change this weight to ensure that the next time the network sees this image of a handwritten 1, all the wrong output neurons, including number 8, are now weaker and the right neuron, number 1, is stronger? Or you can think of this as descending along the gradient of that energy function that I mentioned. Since there's no guarantee that the next image won't need some completely differently weight changes, we make only a tiny change to the weight. And we show the network all the training images we have over and over again but randomizing the presentation order. This is why we call this stochastic gradient descent.
05:11 GB: Computer scientists throughout the world, including many of my colleagues at IBM Research, are driving the evolution of AI from narrow AI that can learn from a lot of data to do one task really well, towards broad AI that can handle multiple tasks, domains and modes that can be explainable, trusted and distributed, and that can still be trained to high accuracy even when we don't have a lot of training data to work with. IBM is helping many enterprise clients with computer-vision applications in fields such as construction, insurance, sporting events and technical support. These kinds of applications use convolutional neural networks trained on databases of color images.
05:49 GB: Here, I show CFAR, which is a small data set with about 10 or 100 labels. ImageNet is a larger data set with more pixels per image, more images in the data set and up to 1,000 classes. There can be a lot of pixels in these images, so what a CNN does is learn a cascaded bank of filters that can be convolved across the image. From layer to layer, we use convolution and cooling to reduce the number of pixels but increase the number of filters. Simple filters are learned in the early layers, which then work together with later filters to look for quite complex objects, like the features of a nose of a particular breed of dog. None of this is programmed. It's all learned as we train on the images in the data set.
06:31 GB: Beyond these computer-vision applications enabled by the CMS, we at IBM have many clients with applications that call for some sort of natural language processing. This includes language translation, speech transcription, language processing, including things like automated analysis of social media and chatbots. With these kind of applications we need a network that can deal with sequences and time, like sentences and paragraphs.
One example is known as a recurrent neural network. Each time we put a token or word from the sentence into the network, we also inject the previous state -- shown here with the brown arrow -- back into the network. I show this here as I unroll the operation in the network in time from left to right. By the time we get to the end of the sentence, the network has some memory of each token in the sentence, enough that it is capable of connecting this sentence to the concept of a new account even though the customer did not explicitly say those two words together.
07:27 GB: However, you can graphically see the problem here. The RNN tends to be much better at remembering the recent tokens than the earlier ones. By the time we get to the end of the sentence, very little of the first token "How" is still part of the network's current state. The RNN that is met with the most success is called LSTM, long short-term memory, because it uses a complicated series of gates to try and learn what to remember from each token and what it can afford to forget as it goes through a sequence.
07:56 GB: Another popular network for NLP applications are transformer-based networks. Rather than recurrence, these networks go through the tokens in a sequence and try to build up an attention matrix which the network uses to decide which tokens within the sentence are most relevant to each other, to help it predict what the next most likely token in a sentence might be. That example from the GP-T3 network that I showed you earlier uses many, many transformer layers with about 175 billion learnable parameters.
08:26 GB: Now, in the convolutional networks for image processing, there are a lot of neurons, but each neuron is only connected to a small subset of neurons in the next layer. And the weights in those small filters are extensively reused as we convolve across the entire image. As we will see, this actually makes CNNs fit pretty well onto digital accelerators, especially if memory and processing are organized appropriately. In contrast, both the LSTM and transformer networks for natural language processing tend to use a lot more fully connected layers with large weight matrices and less weight reuse, making these networks much less pleasant to deal with on digital accelerators.
09:02 GB: Now, as we move towards broad AI, we need to be a little nervous because even the narrow AI networks require significant compute. When we train a big CNN on network -- whether we use it, do it slowly with a handful of GPUs, or rapidly with a whole bunch of GPUs -- we're still consuming an enormous amount of energy equivalent to two weeks of home energy consumption. Worse yet, as we have tried to ride along this trend that bigger models perform better, we've been rapidly growing the size of compute we need. This plot shows the training energy for a single training of one network, the largest network in units of petaflop per second per day, has been doubling every 3.5 months. Both training and inference energy grows as the models get bigger. As shown here, training energy also grows as we train with more data and train for longer times. And total inference energy across the world rose as these large models are used on more and more platforms and in more and more applications.
09:56 GB: In any case, these kind of trends call for some serious innovation in terms of energy efficiency. If we look at the AI hardware roadmap for three opportunity areas -- forward inference in the cloud, inference at the edge, and training, which is going to mostly be in the cloud -- we started this field with general-purpose CPUs and GPUs. And we expect these to remain as a critical workhorse for many years. There are a lot of people trying to implement custom digital accelerators to help scale compute performance and energy efficiency. While I list Google's GPU2 and GPU3 here, one could make the case that more and more recent GPU releases from companies like Nvidia are effectively also custom accelerators for deep learning.
In the next section of this talk, I'll briefly summarize the digital accelerator work we're doing within IBM Research before I turn to my main topic of analog memory-based accelerators for deep learning.
10:47 GB: In the conventional von Neumann architecture, computation in the central processing unit requires data to come across the bus from memory. Say we want to multiply two numbers of A and B. Well, the first thing we do is send the data bits associated with the operation and then we send the operands. Once the result is available within the CPU, we need to send it back across the bus to be stored into memory. So, this bus causes us to expend a lot of energy, especially if a lot of data has to move a long distance.
Fortunately, one of the interesting things about deep learning is that we can often reduce the numerical precision of the computations, yet still see the same neural network accuracy. So, we're going to reduce the amount of data going through the bus, not by reducing the number of neurons or the number of synapses, but by reducing the number of bits used to encode each neuron excitation or synaptic weight.
11:37 GB: We at IBM Research have been using this approximate computing across the compute stack to improve energy efficiency. We have introduced a number of compression and quantization techniques both for encoding excitations and weights and fewer bits, but also for sharing weight-graded information across different parts of the system that might be working together to train a very large DNN model. We found ways to change the programming model to reduce the communication mode as I mentioned in the center. And as shown on the right, we've developed customized data flow architectures, which I explained further on this slide. Nothing else.
As you can see in the drawing, this architecture carefully organizes computation in a grid of low-resolution processing elements together with a few higher resolution special function units and data movement to and from nearby scratchpad test rate.
12:22 GB: As a result, our designs are able to support high utilization. They're always busy across the wide variety of workloads, even when the batch size is very, very small. This greatly helps improve energy efficiency because these computer elements are always busy doing real work, spending very little time sitting idle with their hands in their pockets, burning energy, just waiting for data to arrive. Although we can reduce the numerical precisions of the computations, we do want to get the same neural network gaps, even as we reduce the number of bits per weight for activation used for training or inference. Industry practice as of 2019 or so, seems to have settled on some sort of 16-bit floating point for training and 8-bit integers for inference.
13:03 GB: Our team at IBM has been at the forefront of this effort, having published the very first paper showing this was even possible back in 2015. Since that time, we have shown you can train networks to ISO accuracy using a combination of 8-bit floating point formats with careful use of mantissa and exponent bits, together with higher precision at critical points throughout the net. In inference, we have dropped the required bit precision well below 8-bit integers, showing full ISO accuracy across the wide speed of networks at 4-bit precision, and promising results for a subset of popular networks, and near ISO accuracy for 2-bit precision.
13:38 GB: Other trends across the industry for digital accelerators include the exploitation of sparsity, where the hardware can recognize zero weights and zero activations and skip them. Also, the development of networks whose software has been optimized to maximize the energy efficiency, speed and accuracy with a particular hardware architecture in mind, that is hardware-software codesign and the incorporation of more and more on-ship SRAM to reduce the large energy costs of accessing off-chip DRAM. We talked about being more clever in digital accelerators and how we bring data to the CPU.
Let's switch this around and try to bring the computation to the data. Analog accelerators are a form of in-memory computing. When we want to multiply two numbers, we keep one of them in the memory, we bring the other in, and we do the operations at the location of the data.
14:26 GB: If we are moving only a 1D vector, B, but doing computation across an entire 2D matrix, A, this can offer both high energy efficiency and high speed. We should talk about how we map the neural network onto the memory arrays and how the multiplication bits performed.
Going back to our simple fully connected neural net, we map each layer of the neural network onto one or more analog memory tiles with the upstream and downstream neurons mapped to west side or south side peripheral circuitry, and the synaptic weights mapped into the resistive memory elements. Since this is a weight-stationary architecture, each layer of the network needs its own arrays. But this allows us to execute the network in an extremely efficient way, minimizing the data transport to excitation vectors and allowing an efficient pipeline execution of one data example after the next.
15:20 GB: Let's look at how our memory array implements the multiply accumulate operation that we need to perform. We encode weights into conductance values on the array, using selectors, diodes or transistors to access each device we need, and using the difference between two conductances or conductance pair to encode a sine wave. When we use the peripheral neurons to encode each excitation into a voltage signal, V, the act of reading the device conductance, G, produces a current proportional to G times V, which means we have gotten Ohm's Law to perform the multiplication of the excitation and the weight for us.
15:58 GB: This is the same exact operation that we would have performed when reading resistive memory except that here we intentionally do a read on all the rows at the same time, each encoded with its neuron's excitation, each device doing its multiply in parallel, and adding its current to the vertical bit line. This means we have gotten Kirchhoff's Current Law to perform the accumulate for us, and we have performed an entire vector matrix multiply, in parallel, in constant time, at the location of the weight data.
16:25 GB: Now, before going on, I did want to mention that many researchers are working on in-memory compute variance where they restrict the number of states needed from the memory element, often all the way down to binary. While this allows them to use conventional memories like SRAM or to use emerging non-volatile memories like RRAM in a binary memory mode, this also means that they need multiple memory elements across multiple bit lines and all word lines in order to implement each multiply accumulate between weights and activations for even a few bits of precision.
16:55 GB: But, in the rest of my talk, I will focus on the full analog AI opportunity that excites us the most, with weights encoded into continuously analog conductive states. We're not just waving our hands when we say this can provide high performance and high energy efficiency. We have put full 90 nanometer designs to performance projections, and we are starting to verify these in silicon.
In the top plot, you can see that the energy required decreases and the energy efficiency starts to increase as we move from left to right, decreasing the conductance and thus the recurrent. And in the bottom plot, as we move from right to left, we can see that we start to get interesting performance per unit area as we push the integration time per analog tile below and then well below 100 nano-centimeters.
17:39 GB: In general, high performance and energy efficiency in analog AI is maximized for DNN layers with low weight reuse, such as the fully connected layers that I described in LSTM and transformer networks. It's maximized when we could be really efficient in how we send DNN excitations from the edge of one tile to the edge of the next and maximized when we have good analog memory devices.
So, what would make a good resistive analog memory element? Well, if we focus on forward inference, we need only modest endurance in programming speed. We want a device that we can keep scaling down in size, and we want to make sure we don't have too much recurrent to have the right resistance range. We want to be able to program conductance states with modest amounts of power, which then stay stable in time even at elevated temperature.
18:25 GB: The device we have chosen to focus on for inference is phase-change memory or PCM. PCM depends on transitions between a highly resistive amorphous phase and a low resistance crystalline phase induced by dual heating which we can control by the amplitude and shape of applied voltage pulses. PCM satisfies many of the necessary criteria. As a device in the back end of the line wiring, we could easily make small devices above the silicon, and when we do so, the programming power can be small. That's not the hard part. The engineering trick is fabricating all the PCM devices at exactly the same size, as we make them small. But even as we are now starting to get good at that, there are some real challenges for retention, for programming error, and for conductance stability all related in some way to polycrystalline grains.
19:08 GB: If we look at the prototypical mushroom cell, we see that the high current density as right current tries to get from the large top electrode to the narrow bottom electrode -- also known as a heater -- we get this hemispherical dome of hot material, thus the name, the mushroom cell, because it looks like a cap of a mushroom. If this hot material goes over the melting point around 600 degrees Celsius and then quenches to room temperature fast enough, we block all the paths for recurrent with a resistive amorphous plot leading to a G value close to zero.
19:40 GB: If the subsequent pulses ramp down very slowly and we get the state on the far right where recurring can easily get to the heater through highly crystalline material, and the G value can be fairly large, perhaps tens of micro seeds. But they have complete analog tuneability, we need those center states, and we need to program them into intermediate states where recurrent has to make its way through regions that have numerous small polycrystal grains. And these grains seem to have an impact on retention, programming and conductance stability.
20:09 GB: The same physics that enables the fast crystal growth is still slowly working away at lower temperatures eventually leading to retention issues. We can tune this to higher temperatures by using PCM materials that support or exhibit higher crystallization temperature. The physics of grain nucleation in terms of programming means that each pulse generates some new configuration of grains. So, we have some stochasticity here, fundamentally. But the right weight to see requirements for inference are much more favorable than a memory application. So, we should have enough time to do iterative close with tuning of the conductances. And at least in first implementations, we can tradeoff some areal density to get more redundancy and SNR, signal to noise ratio, using what we call multiple conductances of varying significance. However accurately we performed a program conductance state, we need to maintain that state over time. Unfortunately, continued relaxation of the amorphous space, both in the main plug and at all the grain values, contribute to conductance drift.
21:08 GB: Fortunately, this is an effect that slows down in time shown here as straight lines on a log plot of conductance versus time with a slope new that is usually 0.1 or less. How can we deal with this drift? Well, we can use a time-dependent gain factor which can help us subtract off the average drift at least for a little while, although eventually we'll have to worry about amplification of background noise.
21:30 GB: Unfortunately, drift also exhibits a cycle-to-cycle stochasticity ascribed to those same polycrystal grains. Hardware retraining can help us a lot here, exposing the network to noise during training so it will be more robust to noise on the weights, including from drift variability during inference. This is very similar to what our digital accelerator colleagues already do when they carefully train a quantized deep neural network to have high accuracy despite reduced digital precision during inference.
Finally, we have been working for quite some time on new PCM device designs. In a projected PCM device, the amorphous plug doesn't block recurrence so much as deflect it into an intentionally included shunt light.
22:08 GB: This sacrifices some of the huge conductance contrast in other devices. But, in exchange, we get significantly lower drift, lower drift variability, and lower random telegraph noise. As shown in the plot on the lower right, we have been able to use these kinds of devices to achieve matrix operations at an effective precision equivalent to 8-bit fix point. We even have to zoom into the correlation plot to be able to see the noise. So, we're very excited and optimistic about the combination of all these techniques.
22:36 GB: In my last few minutes, let me switch over and talk about analog AI for training. Here, we are looking for a slightly different kind of device worrying more about endurance than retention. We still want fast low-power programming, and we still want device conductance to remain stable and unchanged between weight updates. But now we want to be able to make tiny, gradual and, most of all, symmetric conductance changes to the resistive memory devices encoding our weights.
23:00 GB: The network is going to apply many, many weight updates with the expectation that the weight increase and the weight decrease requests will all cancel correctly. Our device needs to exhibit symmetric conductance response to deliver on this expectation. If this doesn't happen, training accuracy almost always suffers.
Here, we at IBM Research have been focusing on resistive RAM which an insulating layer like say hafnium dioxide is bridged by creating and then modulating conductive filaments composed of vacancies and defense. RM can meet many of the requirements for analog AI training, like PCM RM can be made in the back-end wiring above the silicon although endurance yield and variability are not yet at the levels demonstrated in PCM. Programming power can be kept low by enforcing compliance external, but because this reduces the number of atoms being moved around at the weak point of the filament, we also have to contend with more and more . . . or shot noise like statistical variations as we decrease programming power.
23:57 GB: Despite all that, we have shown that you can engineer these RM devices so that you can move through a continuum of analog conductive states. As you alternate between conductance increased pulses shown in red and conductance decrease pulses shown in blue, even by eye you can see that the symmetry is not quite perfect. While we continue to improve RM, we're exploring other possibilities as well, including free terminal EC RAM devices. First introduced by Sandia Labs, these devices act like tiny batteries, and the voltage applied to the gate can be used to add or subtract ions from the conductor into or out of the conductive channel between the two other electrodes. While there are many challenges to making this fast and small and paralyzing them in large arrays, these devices definitely exhibit that linear, symmetric and gradual conductance update that we asked them.
Another approach we've tried are systems solutions that separate out the role of permanent weight storage from that critical role of gradient accumulation where we need the symmetric weight update. In a paper that my group wrote in Nature, we did this by combining volatile and non-volatile analog memories together in a complex unit cell. In a paper by my IBM Zurich colleagues, they accumulated the gradient digital hardware and then transferred it onto PCM devices. And, finally, some of my colleagues have been looking into the underlying training algorithm itself.
25:06 GB: So, now I come to the conclusion. We can predict that deep learning is going to need custom hardware accelerators and there is a real opportunity to make digital accelerators more energy efficiency by aggressively reducing . . .
25:23 GB: The convolutional networks, we can go even further by exploiting weight-reduced sparsity by doing extensive hardware software cooptimization. Meanwhile, there's this parallel opportunity to use in memory computing. I showed you how Kirchhoff's Current Law can enable efficient analog summation, and it seems like some variant of this will almost certainly show up in the future, otherwise digital accelerators. In our analog AI project, we're trying to also exploit the benefits of multiplication of Ohm's Law, which we feel is really attractive for energy efficiency and speed up on the fully connected layers that we find throughout the networks being used for natural language processing. I described our recent progress in implementing inference using phase-change memory devices and laid out many of the things we're doing to address stoichiometry drift and noise.
26:06 GB: And I described our recent progress towards analog AI for training, including device work on RM and EC RAM, to implement the symmetric and gentle conductance update that SGD training would require, and then new training algorithms that can harness asymmetry to provide software equivalent training even with highly imperfect devices.
26:24 GB: With that, I would like to thank both the colleagues across our worldwide team who contributed some of the slides in my deck, including Dr. Sydney Goldberg, Dr. Kailash Gopalakrishnan and Dr. Praneet Adusumilli. And I would also like to acknowledge the help and advice of the many colleagues in our analog AI team at the Almaden Research Center in California, including our close collaborators from IBM Tokyo and IBM Yorktown Heights, and I'd also like to thank the management support over many years, both from local management in our Almaden site and from our global analog AI team management.
I will end with this slide, so I can point out that all this work is performed as part of the IBM Research AI Hardware Center, which is a cross-industry partnership between IBM Research, IBM clients, academia and the State of New York. In the lower right corner, I put my contact info and a link to our Almaden analog AI group in purple and in bright blue are links to learn more about the IBM Research AI Hardware center. Thank you for listening and stay safe.
27:19 GB: The organizers have asked me to prepare two audience questions and then go ahead and answer these. While this feels a little bit like cheating, I have done as they requested. My first question to myself is, "Can you tell us more about this new training algorithm?"
My IBM Yorktown colleague, Tayfun Gokmen, has come up with an alternative stochastic gradient set that can use the asymmetry in our RM devices for computation. He calls this the Tiki-Taka algorithm. As you can see here along the horizontal black line, and if you use the backpropagated errors and do SGD style weight update directly on weight matrices encoded into highly asymmetric RM devices, the network refuses to converge and training error remains much higher than we'd expect for this network on digital full precision hardware no matter how long we train.
28:07 GB: But if we do an exchange of these weight updates into a second set of matrices and pass these updates back and forth like we were Spanish soccer players, you can see from the blue line that we obtain the same low training error and thus high accuracy that we got from full precision digital systems, despite the fact that the analog memory devices exhibit highly asymmetric conductance response. So, again, we're very excited, optimistic of the culmination of some or all of these techniques for analog AI training.
28:36 GB: The second question from myself is a bit less of a softball: "What are you going to do when you run out of devices to encode the model? And isn't the scheme of using multiple conductance per weight going to make this problem even worse?" Good question.
Well, first, we can always connect more layers into the network by using multiple chips. All that means is that one of these arrows leaving a tile on one chip needs to go to a second chip. If we pick the spot cleverly, energy issues won't be too bad. For the second part, let me go back and go into the benefits that we get from multiple conductances with varying significance.
As this slide shows, even in a situation where some of the devices seem to have a ceiling on the conductance to which they can be programmed, by organizing each weight, there's two conductance barriers. We get some built-in redundancy and a much better ability to get analog weights tuned into exactly the target value we want.
29:25 GB: And again, as we get better at engineering and programming PCM devices, we have passed a higher areal density. We can go back to one conductance pair, just G+ and the G-, or we can reach even higher density by giving each weight its own G+ device, but having them use together a shared G- device. We can also expect significant density improvements as we reduce programming and recurrence. These two together will allow us to shrink the access transistors and to use narrow wiring as we go across the arrays. Thank you again.