There's no denying the sweeping impact of artificial intelligence on enterprises. From automating business processes to acquiring new customers, the technology is everywhere. The rapid development of AI technology is supported in part by deep learning, a subset of AI that trains algorithms to learn in humanlike steps.
Stacking deep learning algorithms provides for automated, intelligent algorithms that can be applied to various enterprise issues. The science behind deep learning is a vast map of neural networks, a series of algorithms that can sort and train itself to process data. Data scientists and companies wanting to build a neural network for process-focused, personalized deep learning technologies for their enterprise must start by creating a simple neural network.
The problem? There's nothing simple about neural networks. The process of human learning is difficult to understand and explain, and neural networks often suffer from the same difficulty. Data scientists who want to build a neural network from scratch need to understand the basic science behind training alongside the code required to build their network. Then, building the model from the ground floor requires a lengthy training process rooted in data preparation, mathematical visualization and natural language processing.
Breaking down the complex math concepts behind deep learning is a key to understanding the "why" behind the foundational techniques for training neural networks. Grokking Deep Learning, written by Andrew Trask and published by Manning Publications Co., uses graphics and colloquial language to break down the complex science behind building a neural network. For data scientists at any level, this overview of deep learning covers parameter tuning, overfitting, data cleansing, coding techniques and eventual applications of the technology.
Chapter 6 of Grokking Deep Learning focuses on how to build and train your first neural network with custom parameters, complete with graphics and easy-to-follow instructions.
Full, batch, and stochastic gradient descent
Stochastic gradient descent updates weights one example at a time.
As it turns out, this idea of learning one example at a time is a variant on gradient descent called stochastic gradient descent, and it’s one of the handful of methods that can be used to learn an entire dataset.
How does stochastic gradient descent work? As you saw in the previous example, it performs a prediction and weight update for each training example separately. In other words, it takes the first streetlight, tries to predict it, calculates the weight_delta, and updates the weights. Then it moves on to the second streetlight, and so on. It iterates through the entire dataset many times until it can find a weight configuration that works well for all the training examples.
(Full) gradient descent updates weights one dataset at a time.
As introduced in chapter 4, another method for learning an entire dataset is gradient descent (or average/full gradient descent). Instead of updating the weights once for each training example, the network calculates the average weight_delta over the entire dataset, changing the weights only each time it computes a full average.
Batch gradient descent updates weights after n examples.
This will be covered in more detail later, but there’s also a third configuration that sort of splits the difference between stochastic gradient descent and full gradient descent. Instead of updating the weights after just one example or after the entire dataset of examples, you choose a batch size (typically between 8 and 256) of examples, after which the weights are updated.
We’ll discuss this more later in the book, but for now, recognize that the previous example created a neural network that can learn the entire streetlights dataset by training on each example, one at a time.