Deep Learning Training Tricks

As neural networks dey go deeper, e dey harder to train dem. One big wahala wey fit happen na vanishing gradients or exploding gradients. Dis post explain well well about dis problems.

To make training of deep networks better, some techniques dey wey we fit use.

Keep values for correct range

To make sure say calculations no go scatter, we go wan make sure say all the values for inside our neural network dey for correct range, like [-1..1] or [0..1]. No be say e dey compulsory like that, but because of how floating point calculations dey work, values wey get different size no dey work well together. For example, if we add 10^-10 and 10¹⁰, wetin we go get fit be 10¹⁰, because the smaller value go just disappear.

Most activation functions dey work well for values around [-1..1], so e make sense to scale all input data to [-1..1] or [0..1].

Initial Weight Initialization

We go like make the values still dey for the same range after dem pass through network layers. So, e dey important to initialize weights in a way wey go keep the values balanced.

Normal distribution N(0,1) no too good, because if we get n inputs, the standard deviation of output go be n, and the values fit jump comot from [0..1].

Dis na some common ways to initialize weights:

Uniform distribution -- uniform
N(0,1/n) -- gaussian
N(0,1/√n_in) go make sure say inputs wey get zero mean and standard deviation of 1 go still maintain the same mean/standard deviation
N(0,√2/(n_in+n_out)) -- dem dey call dis one Xavier initialization (glorot), e dey help keep signals for correct range during forward and backward propagation

Batch Normalization

Even if we initialize weights well, weights fit still grow too big or too small during training, and e go scatter the signals. We fit use normalization techniques to bring the signals back. Even though we get different types (Weight normalization, Layer Normalization), the one wey people dey use pass na Batch Normalization.

The idea of batch normalization na to use all the values for the minibatch, and normalize dem (subtract mean and divide by standard deviation). E dey work as one network layer wey dey normalize after weights don apply, but before activation function. E dey help training go faster and give better accuracy.

Here be the original paper on batch normalization, the explanation on Wikipedia, and a good introductory blog post (and the one in Russian).

Dropout

Dropout na one kind technique wey dey remove some random neurons during training. E dey work as one layer wey get one parameter (percentage of neurons to remove, like 10%-50%), and during training, e go set some random parts of the input vector to zero before e pass am to the next layer.

Even though e fit sound strange, you fit see how dropout dey work for training MNIST digit classifier for Dropout.ipynb notebook. E dey make training faster and help us get better accuracy with fewer training epochs.

Dis na why dropout dey work:

E fit act like random shock to the model, wey go help am comot from local minimum
E fit act like implicit model averaging, because during dropout, we dey train small small different models

Some people dey talk say if drunk person dey learn something, e go remember am better the next day compared to sober person, because brain wey no dey work well go try adapt better. We never test am sha to know if na true.

How to stop overfitting

One big thing for deep learning na how to stop overfitting. Even though e dey sweet to use very powerful neural network model, we suppose balance the number of model parameters with the number of training samples.

Make sure say you sabi the meaning of overfitting wey we don talk about before!

Ways to stop overfitting:

Early stopping -- dey check validation error and stop training when validation error begin increase.
Explicit Weight Decay / Regularization -- add extra penalty to loss function for weights wey get high values, so the model no go scatter.
Model Averaging -- train different models and average the result. E dey reduce variance.
Dropout (Implicit Model Averaging)

Optimizers / Training Algorithms

Another important thing for training na to choose better training algorithm. Even though gradient descent dey okay, e fit slow or get other wahala.

For deep learning, we dey use Stochastic Gradient Descent (SGD), wey be gradient descent wey dey work on minibatches wey dem pick randomly from training set. Weights dey adjust like this:

w^t+1 = w^t - η∇ℒ

Momentum

For momentum SGD, we dey keep part of the gradient from before. E be like say if person dey waka with speed, and something push am for another direction, e no go change direction immediately. We dey use another vector v to show the speed:

v^t+1 = γ v^t - η∇ℒ
w^t+1 = w^t+v^t+1

The parameter γ dey show how much inertia we go use: γ=0 na normal SGD; γ=1 na pure motion.

Adam, Adagrad, etc.

For each layer, we dey multiply signals by matrix W_i. Depending on ||W_i||, gradient fit reduce to 0 or blow up. Dis na the wahala of Exploding/Vanishing Gradients.

One way to solve dis problem na to use only the direction of the gradient, ignore the size:

w^t+1 = w^t - η(∇ℒ/||∇ℒ||), where ||∇ℒ|| = √∑(∇ℒ)²

Dis algorithm na Adagrad. Other similar algorithms: RMSProp, Adam

Adam dey work well for many things, so if you no sure which one to use, use Adam.

Gradient clipping

Gradient clipping na extension of the idea above. If ||∇ℒ|| ≤ θ, we go use the original gradient for weight optimization. But if ||∇ℒ|| > θ, we go divide the gradient by e norm. Here θ na parameter, most times we fit use θ=1 or θ=10.

Learning rate decay

Training success dey depend on learning rate parameter η. Big η fit make training fast for the beginning, but small η fit help fine-tune the network later. So, most times we go wan reduce η as training dey go.

We fit do am by multiplying η by small number (e.g., 0.98) after each epoch, or use more complex learning rate schedule.

Different Network Architectures

To choose correct network architecture for your problem fit hard. Normally, we go use architecture wey don work for similar task before. Here be good overview of neural network architectures for computer vision.

E dey important to choose architecture wey go fit the number of training samples wey we get. If the model too powerful, e fit cause overfitting.

Another way na to use architecture wey go adjust to the complexity wey we need. ResNet and Inception dey adjust small small. More on computer vision architectures

Disclaimer:
Dis dokyument don use AI translation service Co-op Translator do di translation. Even as we dey try make am accurate, abeg make you sabi say automated translations fit get mistake or no dey correct well. Di original dokyument wey dey for im native language na di one wey you go take as di correct source. For important information, e better make professional human translation dey use. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep Learning Training Tricks

Keep values for correct range

Initial Weight Initialization

Batch Normalization

Dropout

How to stop overfitting

Optimizers / Training Algorithms

Momentum

Adam, Adagrad, etc.

Gradient clipping

Learning rate decay

Different Network Architectures

FilesExpand file tree

TrainingTricks.md

Latest commit

History

TrainingTricks.md

File metadata and controls

Deep Learning Training Tricks

Keep values for correct range

Initial Weight Initialization

Batch Normalization

Dropout

How to stop overfitting

Optimizers / Training Algorithms

Momentum

Adam, Adagrad, etc.

Gradient clipping

Learning rate decay

Different Network Architectures