### The Rise of Deep Learning

In the span of a few years, deep learning has taken the world by storm and has established itself as a very powerful tool under many applications such as image classification, anomaly detection, natural language processing, and much more. This became possible especially through the emergence of deep neural networks, architecturally layered models that perform at high efficiency when given sufficient enough data.

### The Complicated Architecture of Neural Networks

Despite the rapid growth of neural networks, many new developers, including myself, still struggle in the process of constructing the network architecture. But why? Simply put, neural networks have a complicated architecture that requires proper manual configuration of specific hyper parameters. But before moving any forward, let’s quickly clarify what the difference between **parameters** and **hyper parameters** is.

#### Parameters vs. Hyper Parameters

A model’s parameters are variables that the model learns and continuously adjusts independently throughout the training stage. For instance, this can be the weights and biases of a linear regression model. Contrarily, a model’s hyper parameters are manually configured variables whose values are chosen by the developer. These hyper parameters are then fed into the network to affect the learning process followed by that network.

#### The Difficulty in Tuning Hyper Parameters

Tuning hyper parameters is a complicated, and sometimes stressful, process because one trivial mistake in the choice for a specific hyper parameter can produce an extremely poor model. In addition to that, this process often involves a lot of experimentation and multiple trial and error runs to produce an high performing model. This is often not only a guessing game for what the value should be for a specific hyper parameter, but instead what a set of values should be for a designated set of hyper parameters. For instance, developers may often find themselves asking questions of the type

If I increase the value of X, could the model perform better if I also increase the value of Y and decrease the value of Z?

This curiosity is essentially what drives the systematic experimentation step series, but it is also what allows the production of extremely efficient models.

#### Hyper Parameters Discussed in This Post

There’s quite a bunch of hyper parameters one can discuss, but for the sake of keeping this post short and simple, I will discuss the main ones that I think developers should understand well when constructing neural networks. The following is a list of the hyper parameters that will be explored in this post:

- Mini-Batch Size
- Loss Function Choice
- Learning Rate
- Number of Epochs
- Number of Hidden Layers and Neuron Count at Each Layer

#### Mini-Batch Size

Mini-batch gradient descent is similar to the stochastic gradient descent algorithm, except in the regard that it splits the training data into a set of batches, each having a size that is specified by the developer. These batches are then used in the learning step as the model’s parameters get updated based on the model’s errors on each batch. The purpose of mini-batch gradient descent is to reach a healthy balance between the accuracy that stochastic gradient descent outputs and the speed and efficiency that batch gradient descent provides.

Considering that mini-batch gradient descent is the algorithm strategy that is most frequently used in deep learning, it is important for one to understand how a specified batch size affects the model’s speed and accuracy.

##### Increasing Batch Size

The larger our batch size is, the closer our gradient descent becomes to batch gradient descent. Essentially, the largest batch size we could have is the size of the training data itself, which would split the training data into (yup you guessed it) 1 batch only. In that case, we would be performing batch gradient descent since we compute the model error for each entry in that single batch and only update the model’s parameters after going through the entire batch (i.e. the full training data in this case). Moreover, increasing the batch size results in a slower learning convergence but more accurate results.

##### Decreasing Batch Size

The lower our batch size is, the closer our gradient descent becomes to stochastic gradient descent. The smallest batch size we can potentially have is 1, which would split the training data into * n* batches, where

**is the length of the training data. In that case, we would be performing stochastic gradient descent since each batch is 1 entry in the training data, and for each of those batches, we compute the model error and immediately update the model’s parameters. Moreover, decreasing the batch size results in a faster learning convergence but less accurate results**

*n*##### Size Configuration Norm

Generally speaking, the default batch size tends to be set to 32 because that size works well quite often. However, if it is necessary to choose a batch size other than that, then the common strategy is to use values that work nicely with the architecture of the machine building the model. That is, numbers that are powers of 2 (i.e. 32, 64, 128, and so on) work really well since they fit directly with memory constraints of accelerator hardware such as GPUs or even the CPU itself.

#### Loss Function Choice

There’s a plethora of loss functions to choose from, but for this post, I will be focusing on a small specific subset of them.

##### Mean Squared Error (MSE), Quadratic Loss, L2 Loss

When it comes to regression, MSE is the most common loss function to be used. The actual loss output of this function is the summation of the squared distance between the predicted and target values.

$MSE = \cfrac{\sum_{j = 1}^{n} (y_{j} – y_{j}^{p})^{2}}{n}$

In the above function, $y_{j}$ is the target value at index $j$ while $y_{j}^{p}$ is the predicted value at index $j$.

The graph of the MSE function is parabolic as it is a polynomial function of degree 2.

##### Mean Absolute Error (MAE), L1 Loss

Another great loss function that’s commonly used for regression is MAE. This function is the summation of absolute differences between the predicted and target values.

$MAE = \cfrac{\sum_{j = 1}^{n} |y_{j} – y_{j}^{p}|}{n}$

##### MSE vs. MAE

Solving MSE is much easier than solving MAE. However. when the dataset contains a lot of outliers, MAE tends to be the more robust option. This is because larger errors tend to rapidly increase the MSE loss output in comparison to the MAE loss because of the squaring effect. Hence, with outliers being really far away from predicted values, a model using MSE loss will give more weight to the outliers, which will essentially skew the fitted model. Hence, if the training data is such that it contains a lot of outliers, it is often better to instead use the MAE loss as that handles the outliers much better than MSE.

Unfortunately, MAE comes with its own fair share of problems. One of the main problems of MAE is that its gradient remains constant the entire time, which means the gradient will remain large even when our model is very close to the global minima, which will cause our model to overshoot the global minima if we don’t take precautions. Hence, the typical solution to this issue is to dynamically adjust the learning rate and lower it gradually throughout training.

##### Huber, Smooth L1 Loss

I like to think of the Huber loss function as a balance between MSE and MAE. The advantage Huber loss has over MSE is that it’s less sensitive to outliers. The advantage it has over MAE is that it’s actually differentiable at 0, meaning that the gradient is no longer constant. The main drawback of Huber loss is that its sensitivity to outliers is tweaked iteratively through the use of yet another hyper parameter $\delta$, which can sometimes be time consuming.

#### Learning Rate

The learning rate is the hyper parameter that gets used by our model during gradient descent to update the model’s parameters. When updating the model’s parameters, the general rule of thumb is to move against the direction of the gradient. But how many units do we move? That’s where the learning rate comes into play. The learning rate is multiplied by the negative gradient and that is how many units we adjust the model’s parameters by. The goal is to eventually arrive at model parameter values that minimize the output of our loss function.

Suppose we predicted the output for a batch of data and then computed our loss value. If the slope of the loss function at that point is positive, that means the loss function is increasing at that point. This implies that to reduce the loss, we need to decrease the values of our model’s parameters. Similarly, if the slope of the loss function at that point is negative, that means the loss function is decreasing there. This implies that to continue reducing the loss, we need to increase the values of our model’s parameters. This is the backbone logic behind the gradient descent update rule that is shown below:

$\theta_{j} := \theta_{j} – \alpha \nabla_{\theta_{j}} L(\theta_{j})$

##### Increasing the Learning Rate

The larger our learning rate is, the faster our model is able to learn. This is because the adjustment to the model’s parameters is directly proportional to the learning rate. That is, if the learning rate increases, the model’s parameters are adjusted by a larger factor. A high learning rate is beneficial in the early stages of training as it allows our model to get close to the global minima of our chosen loss function really fast. It does, however, come with a drawback. When we approach the global minima of the loss curve, if our learning rate is too high, we might overshoot and completely skip over the global minima.

##### Decreasing the Learning Rate

The smaller our learning rate is, the slower our model learns. As explained earlier, the adjustment of model parameters is directly proportional to the learning rate. Hence, if the learning rate decreases, then the model’s parameters are adjusted by a smaller factor. A low learning rate is beneficial in the late stages of training as it allows our model to get really close to the global minima of the loss function without running much risk of overshooting the global minima. However, as with a high learning rate, it also comes with its own drawback. If we’re at a local minima that is far from being the optimum solution, then a low learning rate may force our model to undershoot and essentially keep it stuck at that local minima instead of allowing it to discover the global minima.

##### General Strategy

Usually, developers will take the following approach when it comes to choosing learning rates:

- Initiate training with a high learning rate to decrease the loss and approach global minima really fast
- Seal training with a low learning rate to continue decreasing loss without overshooting the global minima

#### Number of Epochs

An epoch terminates once the full dataset has been passed forward and backward through the neural network exactly once. If the dataset is too large and cannot possibly be loaded entirely in memory, then it must be split into mini batches. After each epoch, the model’s parameters are updated accordingly. Initially, after just the first epoch, our model starts by underfitting the data (i.e. it hasn’t learned much about the data). Over time through a series of epochs, our model gradually approaches an optimal approximation of the data. Once an optimal solution is reached, running more epochs will overfit the data, meaning that our model will learn too much specificity about the training data that it can’t generalize predictions to new, unseen data.

##### So how many epochs do we need?

To be completely honest, there’s no clear cut number that’ll always work. The right number varies across datasets. This is more of a trial and error problem that you manually solve by testing multiple epoch numbers to see what works best. As long as you don’t reach a point of overfitting or where your validation loss halts, then the number of epochs you have should be fine.

#### Number of Hidden Layers and Neuron Count at Each Layer

To start off, if your dataset is linearly separable, then you don’t need any hidden layers. From a performance perspective, it is commonly understood that the addition of second, third, and further layers improves the performance in very few cases. Generally speaking, one hidden layer is good enough for most problems. However, choosing the number of hidden layers is only part of the problem. The remaining part, deciding how many neurons you want each layer to have, is very critical because an inappropriate value will result in underfitting or overfitting the data.

##### Very Few Neurons

If your hidden layers contain an insufficient amount of neurons, then the model will underfit the data. This results because there are very little amount of neurons to be able to detect the complicated patterns present in the data.

##### Too Many Neurons

If your hidden layers contain an overly saturated amount of neurons, then the model will overfit the data. This results because the extremely large amount of neurons provides the model with too much processing capacity such that the small amount of data is incapable of fully training all neurons in the hidden layer.

##### So what makes a good number of neurons?

Well, similar to the epoch count, there really is no magic number here. There is a set of general rules of thumb for choosing a solid number of hidden layer neurons, such as the following:

- The number of neurons at the hidden layer should be between the sizes of the input and output layers.
- The number of neurons should be roughly $\cfrac{2}{3}$(input layer size) + (output layer size)
- The number of neurons should be less than 2 $\times$ (input layer size)

These three guidelines are generally a good starting position and certainly provide developers with a flexible number of options to experiment with manually.