Artificial Intelligence · Batch Gradient Descent · Batch Size · Deep Learning · Epochs · Guides · Hidden Layers · Hidden Neurons · Hyperparameters · Learning Rate · Loss Functions · Machine Learning · Mini-Batch Gradient Descent · Neural Networks · Optimizers · Stochastic Gradient Descent

Hyper Parameter Tuning… What’s That?

Derived from here

The Rise of Deep Learning

In the span of a few years, deep learning has taken the world by storm and has established itself as a very powerful tool under many applications such as image classification, anomaly detection, natural language processing, and much more. This became possible especially through the emergence of deep neural networks, architecturally layered models that perform at high efficiency when given sufficient enough data.

The Complicated Architecture of Neural Networks

Despite the rapid growth of neural networks, many new developers, including myself, still struggle in the process of constructing the network architecture. But why? Simply put, neural networks have a complicated architecture that requires proper manual configuration of specific hyper parameters. But before moving any forward, let’s quickly clarify what the difference between parameters and hyper parameters is.

Parameters vs. Hyper Parameters

A model’s parameters are variables that the model learns and continuously adjusts independently throughout the training stage. For instance, this can be the weights and biases of a linear regression model. Contrarily, a model’s hyper parameters are manually configured variables whose values are chosen by the developer. These hyper parameters are then fed into the network to affect the learning process followed by that network.

The Difficulty in Tuning Hyper Parameters

Tuning hyper parameters is a complicated, and sometimes stressful, process because one trivial mistake in the choice for a specific hyper parameter can produce an extremely poor model. In addition to that, this process often involves a lot of experimentation and multiple trial and error runs to produce an high performing model. This is often not only a guessing game for what the value should be for a specific hyper parameter, but instead what a set of values should be for a designated set of hyper parameters. For instance, developers may often find themselves asking questions of the type

If I increase the value of X, could the model perform better if I also increase the value of Y and decrease the value of Z?

This curiosity is essentially what drives the systematic experimentation step series, but it is also what allows the production of extremely efficient models.

Hyper Parameters Discussed in This Post

There’s quite a bunch of hyper parameters one can discuss, but for the sake of keeping this post short and simple, I will discuss the main ones that I think developers should understand well when constructing neural networks. The following is a list of the hyper parameters that will be explored in this post:

  1. Mini-Batch Size
  2. Loss Function Choice
  3. Learning Rate
  4. Number of Epochs
  5. Number of Hidden Layers and Neuron Count at Each Layer

Mini-Batch Size

Mini-batch gradient descent is similar to the stochastic gradient descent algorithm, except in the regard that it splits the training data into a set of batches, each having a size that is specified by the developer. These batches are then used in the learning step as the model’s parameters get updated based on the model’s errors on each batch. The purpose of mini-batch gradient descent is to reach a healthy balance between the accuracy that stochastic gradient descent outputs and the speed and efficiency that batch gradient descent provides.

Considering that mini-batch gradient descent is the algorithm strategy that is most frequently used in deep learning, it is important for one to understand how a specified batch size affects the model’s speed and accuracy.

Increasing Batch Size

The larger our batch size is, the closer our gradient descent becomes to batch gradient descent. Essentially, the largest batch size we could have is the size of the training data itself, which would split the training data into (yup you guessed it) 1 batch only.  In that case, we would be performing batch gradient descent since we compute the model error for each entry in that single batch and only update the model’s parameters after going through the entire batch (i.e. the full training data in this case). Moreover, increasing the batch size results in a slower learning convergence but more accurate results.

Decreasing Batch Size

The lower our batch size is, the closer our gradient descent becomes to stochastic gradient descent. The smallest batch size we can potentially have is 1, which would split the training data into n batches, where n is the length of the training data. In that case, we would be performing stochastic gradient descent since each batch is 1 entry in the training data, and for each of those batches, we compute the model error and immediately update the model’s parameters. Moreover, decreasing the batch size results in a faster learning convergence but less accurate results

Size Configuration Norm

Generally speaking, the default batch size tends to be set to 32 because that size works well quite often. However, if it is necessary to choose a batch size other than that, then the common strategy is to use values that work nicely with the architecture of the machine building the model. That is, numbers that are powers of 2 (i.e. 32, 64, 128, and so on) work really well since they fit directly with memory constraints of accelerator hardware such as GPUs or even the CPU itself.

Loss Function Choice

There’s a plethora of loss functions to choose from, but for this post, I will be focusing on a small specific subset of them.

Mean Squared Error (MSE), Quadratic Loss, L2 Loss

When it comes to regression, MSE is the most common loss function to be used. The actual loss output of this function is the summation of the squared distance between the predicted and target values.

$MSE = \cfrac{\sum_{j = 1}^{n} (y_{j} – y_{j}^{p})^{2}}{n}$

In the above function, $y_{j}$ is the target value at index $j$ while $y_{j}^{p}$ is the predicted value at index $j$.

The graph of the MSE function is parabolic as it is a polynomial function of degree 2.

Mean Absolute Error (MAE), L1 Loss

Another great loss function that’s commonly used for regression is MAE. This function is the summation of absolute differences between the predicted and target values.

$MAE = \cfrac{\sum_{j = 1}^{n} |y_{j} – y_{j}^{p}|}{n}$


Solving MSE is much easier than solving MAE. However. when the dataset contains a lot of outliers, MAE tends to be the more robust option. This is because larger errors tend to rapidly increase the MSE loss output in comparison to the MAE loss because of the squaring effect. Hence, with outliers being really far away from predicted values, a model using MSE loss will give more weight to the outliers, which will essentially skew the fitted model. Hence, if the training data is such that it contains a lot of outliers, it is often better to instead use the MAE loss as that handles the outliers much better than MSE.

Unfortunately, MAE comes with its own fair share of problems. One of the main problems of MAE is that its gradient remains constant the entire time, which means the gradient will remain large even when our model is very close to the global minima, which will cause our model to overshoot the global minima if we don’t take precautions. Hence, the typical solution to this issue is to dynamically adjust the learning rate and lower it gradually throughout training.

Huber, Smooth L1 Loss

I like to think of the Huber loss function as a balance between MSE and MAE. The advantage Huber loss has over MSE is that it’s less sensitive to outliers. The advantage it has over MAE is that it’s actually differentiable at 0, meaning that the gradient is no longer constant. The main drawback of Huber loss is that its sensitivity to outliers is tweaked iteratively through the use of yet another hyper parameter $\delta$, which can sometimes be time consuming.

Learning Rate

The learning rate is the hyper parameter that gets used by our model during gradient descent to update the model’s parameters. When updating the model’s parameters, the general rule of thumb is to move against the direction of the gradient. But how many units do we move? That’s where the learning rate comes into play. The learning rate is multiplied by the negative gradient and that is how many units we adjust the model’s parameters by. The goal is to eventually arrive at model parameter values that minimize the output of our loss function.

Suppose we predicted the output for a batch of data and then computed our loss value. If the slope of the loss function at that point is positive, that means the loss function is increasing at that point. This implies that to reduce the loss, we need to decrease the values of our model’s parameters. Similarly, if the slope of the loss function at that point is negative, that means the loss function is decreasing there. This implies that to continue reducing the loss, we need to increase the values of our model’s parameters. This is the backbone logic behind the gradient descent update rule that is shown below:

$\theta_{j} := \theta_{j}  –  \alpha \nabla_{\theta_{j}} L(\theta_{j})$

Increasing the Learning Rate

The larger our learning rate is, the faster our model is able to learn. This is because the adjustment to the model’s parameters is directly proportional to the learning rate. That is, if the learning rate increases, the model’s parameters are adjusted by a larger factor. A high learning rate is beneficial in the early stages of training as it allows our model to get close to the global minima of our chosen loss function really fast. It does, however, come with a drawback. When we approach the global minima of the loss curve, if our learning rate is too high, we might overshoot and completely skip over the global minima.

Decreasing the Learning Rate

The smaller our learning rate is, the slower our model learns. As explained earlier, the adjustment of model parameters is directly proportional to the learning rate. Hence, if the learning rate decreases, then the model’s parameters are adjusted by a smaller factor. A low learning rate is beneficial in the late stages of training as it allows our model to get really close to the global minima of the loss function without running much risk of overshooting the global minima. However, as with a high learning rate, it also comes with its own drawback. If we’re at a local minima that is far from being the optimum solution, then a low learning rate may force our model to undershoot and essentially keep it stuck at that local minima instead of allowing it to discover the global minima.

General Strategy

Usually, developers will take the following approach when it comes to choosing learning rates:

  • Initiate training with a high learning rate to decrease the loss and approach global minima really fast
  • Seal training with a low learning rate to continue decreasing loss without overshooting the global minima

Number of Epochs

An epoch terminates once the full dataset has been passed forward and backward through the neural network exactly once. If the dataset is too large and cannot possibly be loaded entirely in memory, then it must be split into mini batches. After each epoch, the model’s parameters are updated accordingly. Initially, after just the first epoch, our model starts by underfitting the data (i.e. it hasn’t learned much about the data). Over time through a series of epochs, our model gradually approaches an optimal approximation of the data. Once an optimal solution is reached, running more epochs will overfit the data, meaning that our model will learn too much specificity about the training data that it can’t generalize predictions to new, unseen data.

So how many epochs do we need?

To be completely honest, there’s no clear cut number that’ll always work. The right number varies across datasets. This is more of a trial and error problem that you manually solve by testing multiple epoch numbers to see what works best. As long as you don’t reach a point of overfitting or where your validation loss halts, then the number of epochs you have should be fine.

Number of Hidden Layers and Neuron Count at Each Layer

To start off, if your dataset is linearly separable, then you don’t need any hidden layers. From a performance perspective, it is commonly understood that the addition of second, third, and further layers improves the performance in very few cases. Generally speaking, one hidden layer is good enough for most problems. However, choosing the number of hidden layers is only part of the problem. The remaining part, deciding how many neurons you want each layer to have, is very critical because an inappropriate value will result in underfitting or overfitting the data.

Very Few Neurons

If your hidden layers contain an insufficient amount of neurons, then the model will underfit the data. This results because there are very little amount of neurons to be able to detect the complicated patterns present in the data.

Too Many Neurons

If your hidden layers contain an overly saturated amount of neurons, then the model will overfit the data. This results because the extremely large amount of neurons provides the model with too much processing capacity such that the small amount of data is incapable of fully training all neurons in the hidden layer.

So what makes a good number of neurons?

Well, similar to the epoch count, there really is no magic number here. There is a set of general rules of thumb for choosing a solid number of hidden layer neurons, such as the following:

  • The number of neurons at the hidden layer should be between the sizes of the input and output layers.
  • The number of neurons should be roughly $\cfrac{2}{3}$(input layer size) + (output layer size)
  • The number of neurons should be less than 2 $\times$ (input layer size)

These three guidelines are generally a good starting position and certainly provide developers with a flexible number of options to experiment with manually.

Artificial Intelligence · Batch Size · Deep Learning · Epochs · Guides · Hidden Layers · Hidden Neurons · Hyperparameters · Learning Rate · Linear Algebra · Linear Regression · Loss Functions · Machine Learning · Mathematics · Neural Networks · Optimizers · Programming Languages · Python · PyTorch · Side Projects · Stochastic Gradient Descent

Predicting Suicide Rates Using Linear Regression

Derived from here

Hi everyone! I’m taking an online deep learning with PyTorch course, which has turned out to be a really enjoyable experience. That being said, for our second assignment, our core focus was on building a Linear Regression model that predicts insurance charges. I sure finished that assignment. However, to spice things up a bit, I took a decision for myself to attempt using a different dataset to build a similar model that instead predicts suicide rates. Since the objective of this assignment was to fully understand Linear Regression and how to build a model of that type, I decided to stick with that for this optional project as well. The Jupyter Notebook I’ve written can be found below as well as on GitHub here

Predicting Suicide Rates Using Linear Regression

In this notebook, we’ll be using information such as country name, gender, age group, population count, GDP values, and other features to predict the suicide rate for a specific group of people at any given year. This kind of model is useful for suicide prevention and mental health hotline companies in terms of allowing them to figure out the trend in growth of this dilemma as well as in terms of monitoring their progress towards reducing these suicide rates. The dataset for this problem is taken from:

We will create a model with the following steps:

  1. Downloading and Cleaning the Data
  2. Exploring the Data
  3. Preparing the Dataset for Training
  4. Creating a Linear Regression Model
  5. Training the Model to Fit the Data
  6. Evaluating the Model on the Test Data
In [1]:
!pip install jovian --upgrade --quiet
In [2]:
import torch
import jovian
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from import DataLoader, TensorDataset, random_split
In [3]:
# Configuring styles
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
In [4]:
project_name='02-custom-sucide-linear-regression' # will be used by jovian.commit

Step 1: Downloading and Cleaning the Data

To load the dataset into memory, we’ll make use of the read_csv function from the Pandas library.We will then load the data as a Pandas dataframe.

In [5]:
dataframe = pd.read_csv('/kaggle/input/suicide-rates-overview-1985-to-2016/master.csv')

To clean the data, we first start by removing unnecessary columns such as suicides_no, country-year, and generation since these columns are simple transformations of other provided columns and do not provide any new information. We then relabel the gdp_for_year (\$) column since the original dataset has the string with an extra whitespace. Lastly, we convert the gdp_for_year (\$) column to an integer by striping away the commas in the string and then converting the values to type integer.

In [6]:
dataframe = dataframe.drop(['suicides_no', 'country-year', 'generation'], axis=1) # dropping duplicate columns that aren't necessary
dataframe.rename(columns = {' gdp_for_year ($) ':'gdp_for_year ($)'}, inplace = True) # renaming column to remove extra whitespace
dataframe['gdp_for_year ($)'] = dataframe['gdp_for_year ($)'].str.replace(',', '').astype('int') # converting data type to integer
dataframe = dataframe.fillna(dataframe.mean())

Step 2: Exploring the Data

Let’s first see what our data looks like as a starting step.

In [7]:
country year sex age population suicides/100k pop HDI for year gdp_for_year ($) gdp_per_capita ($) generation
0 Albania 1987 male 15-24 years 312900 6.71 0.776601 2156624900 796 Generation X
1 Albania 1987 male 35-54 years 308000 5.19 0.776601 2156624900 796 Silent
2 Albania 1987 female 15-24 years 289700 4.83 0.776601 2156624900 796 Generation X
3 Albania 1987 male 75+ years 21800 4.59 0.776601 2156624900 796 G.I. Generation
4 Albania 1987 male 25-34 years 274300 3.28 0.776601 2156624900 796 Boomers

Let’s also see what data types we’re working with to confirm that everything’s fine.

In [8]:
country                object
year                    int64
sex                    object
age                    object
population              int64
suicides/100k pop     float64
HDI for year          float64
gdp_for_year ($)        int64
gdp_per_capita ($)      int64
generation             object
dtype: object

Now, let’s answer some basic questions about our dataset.

Q: How many rows are there in the dataset?

In [9]:
num_rows = dataframe.shape[0]

Q: How many columns are there in the dataset?

In [10]:
num_cols = dataframe.shape[1]

Q: What are the names of our input features?

In [11]:
input_cols = [col for col in dataframe.columns]
input_cols.remove('suicides/100k pop')
 'HDI for year',
 'gdp_for_year ($)',
 'gdp_per_capita ($)',

Q: What are the names of our categorical features?

In [12]:
categorical_cols = [col for col in dataframe.select_dtypes(exclude=['number']).columns]
['country', 'sex', 'age', 'generation']

Q: What is label that we are predicting?

In [13]:
output_cols = ['suicides/100k pop']
['suicides/100k pop']

Q: What does the distribution of label values look like?

In [14]:
plt.title("Distribution of Suicides per 100k population")
sns.distplot(dataframe['suicides/100k pop']);

Q: What are the minimum, average, and maximum suicide rates?

In [15]:
print("Minimum Suicides per 100k Population: {}".format(dataframe['suicides/100k pop'].min()))
print("Maximum Suicides per 100k Population: {}".format(dataframe['suicides/100k pop'].max()))
print("Average Suicides per 100k Population: {}".format(dataframe['suicides/100k pop'].mean()))
Minimum Suicides per 100k Population: 0.0
Maximum Suicides per 100k Population: 224.97
Average Suicides per 100k Population: 12.816097411933894
In [45]:
jovian.commit(project=project_name, environment=None)
[jovian] Attempting to save notebook..
[jovian] Detected Kaggle notebook...
[jovian] Uploading notebook to

Step 3: Preparing the Dataset for Training

We’ll first begin by converting our Pandas dataframe to a series of numpy arrays. The numeric features will be directly converted to numpy arrays. Categorical features, however, need to be transferred to an appropriate representation. We can do such a transformation by converting the categorical features to dummy binary predictor variables that specify which value of the fixed set of values for that feature is present at a given row entry.

In [17]:
def dataframe_to_arrays(dataframe):
    # Make a copy of the original dataframe
    temp = dataframe.copy(deep=True)
    # Convert non-numeric categorical columns to numbers
    for col in categorical_cols:
        temp[col] = pd.get_dummies(temp[col])
    # Extract input & outputs as numpy arrays
    inputs_array = temp[input_cols].to_numpy()
    targets_array = temp[output_cols].to_numpy()
    return inputs_array, targets_array
In [18]:
inputs_array, targets_array = dataframe_to_arrays(dataframe)
inputs_array, targets_array
(array([[1.00000000e+00, 1.98700000e+03, 0.00000000e+00, ...,
         2.15662490e+09, 7.96000000e+02, 0.00000000e+00],
        [1.00000000e+00, 1.98700000e+03, 0.00000000e+00, ...,
         2.15662490e+09, 7.96000000e+02, 0.00000000e+00],
        [1.00000000e+00, 1.98700000e+03, 1.00000000e+00, ...,
         2.15662490e+09, 7.96000000e+02, 0.00000000e+00],
        [0.00000000e+00, 2.01400000e+03, 0.00000000e+00, ...,
         6.30670772e+10, 2.30900000e+03, 0.00000000e+00],
        [0.00000000e+00, 2.01400000e+03, 1.00000000e+00, ...,
         6.30670772e+10, 2.30900000e+03, 0.00000000e+00],
        [0.00000000e+00, 2.01400000e+03, 1.00000000e+00, ...,
         6.30670772e+10, 2.30900000e+03, 1.00000000e+00]]),

Now that we’ve converted our inputs and targets to numpy arrays, it’s time to convert them to PyTorch tensors.

In [19]:
inputs = torch.from_numpy(inputs_array).type(torch.float32)
targets = torch.from_numpy(targets_array).type(torch.float32)

Let’s check the types to make sure that they’re of type float32.

In [20]:
inputs.dtype, targets.dtype
(torch.float32, torch.float32)

Next, we’ll create our dataset and dataloaders that are needed for training and testing.

In [21]:
dataset = TensorDataset(inputs, targets)

Here, we tweak the percentage of the data owned by each of the training, validation, and testing sets. Then, we randomly split these datasets so that the order of the original data does not have any impact on the results.

In [22]:
val_percent = 0.15
test_percent = 0.15
val_size = int(num_rows * val_percent)
test_size = int(num_rows * test_percent)
train_size = num_rows - val_size - test_size

train_ds, val_ds, test_ds = random_split(dataset, [train_size, val_size, test_size]) # randomly splitting the dataset into train, validation, and test datasets

Batch size is something important in a network. The larger the batch size is, the closer our mini-batch stochastic gradient descent results become relative to a full stochastic gradient descent on the entire data. The reason why we operate on batches is to improve speed, but in return we sacrifice a bit of accuracy in our model for that performance gain. Hence, the larger our batch size is, the more accurate, yet slow learning, our network becomes.

In [23]:
batch_size = 32

In this step, the dataloaders are created. It is important to understand why the training loader is shuffled while the validation and test loaders aren’t. The reasoning behind this approach is that we would not like our model to learn based on the order of the data that is being fed to it because that will bias the model’s understanding of the data towards a very specific direction, which is clearly something we don’t want. But why don’t we shuffle the other loaders? The reason why this is so is because when testing and validating, we really do not care about the order of elements we’re evaluating our model against; all that really matters is how our model performs, which won’t be any different based on order of the data.

In [24]:
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)
test_loader = DataLoader(test_ds, batch_size)

Let’s take a quick look at what one of our batches looks like.

In [25]:
for xb, yb in train_loader:
    print("inputs:", xb)
    print("targets:", yb)
inputs: tensor([[0.0000e+00, 2.0000e+03, 1.0000e+00, 0.0000e+00, 2.3810e+05, 8.1900e-01,
         9.5834e+10, 3.1446e+04, 0.0000e+00],
        [0.0000e+00, 2.0000e+03, 1.0000e+00, 0.0000e+00, 1.1530e+06, 7.6900e-01,
         4.7311e+10, 4.9660e+03, 0.0000e+00],
        [0.0000e+00, 1.9860e+03, 1.0000e+00, 0.0000e+00, 2.2784e+06, 7.7660e-01,
         3.7744e+11, 1.6062e+04, 1.0000e+00],
        [0.0000e+00, 1.9930e+03, 1.0000e+00, 1.0000e+00, 7.1380e+05, 7.7660e-01,
         1.6281e+10, 1.6890e+03, 0.0000e+00],
        [0.0000e+00, 2.0040e+03, 1.0000e+00, 1.0000e+00, 4.2841e+05, 7.7660e-01,
         3.9416e+11, 5.6123e+04, 0.0000e+00],
        [0.0000e+00, 1.9980e+03, 1.0000e+00, 0.0000e+00, 1.7762e+05, 7.7660e-01,
         1.3617e+10, 4.0460e+03, 0.0000e+00],
        [0.0000e+00, 2.0010e+03, 1.0000e+00, 0.0000e+00, 2.6186e+06, 7.7660e-01,
         7.3638e+11, 2.5165e+04, 0.0000e+00],
        [0.0000e+00, 1.9950e+03, 0.0000e+00, 0.0000e+00, 3.4223e+05, 5.1300e-01,
         1.4655e+10, 1.6960e+03, 0.0000e+00],
        [0.0000e+00, 2.0040e+03, 0.0000e+00, 0.0000e+00, 4.7634e+05, 7.7660e-01,
         3.0090e+11, 3.8711e+04, 0.0000e+00],
        [0.0000e+00, 1.9920e+03, 1.0000e+00, 1.0000e+00, 2.2000e+05, 7.7660e-01,
         5.2156e+10, 2.0029e+04, 0.0000e+00],
        [0.0000e+00, 2.0120e+03, 1.0000e+00, 0.0000e+00, 2.9331e+04, 8.3000e-01,
         9.2096e+09, 2.3074e+04, 0.0000e+00],
        [0.0000e+00, 1.9930e+03, 0.0000e+00, 0.0000e+00, 1.0250e+05, 7.7660e-01,
         3.2634e+09, 3.4050e+03, 1.0000e+00],
        [0.0000e+00, 1.9890e+03, 1.0000e+00, 0.0000e+00, 5.0200e+04, 7.7660e-01,
         1.0392e+10, 2.9239e+04, 0.0000e+00],
        [0.0000e+00, 1.9890e+03, 0.0000e+00, 0.0000e+00, 8.4000e+04, 7.7660e-01,
         1.1906e+11, 2.5582e+04, 0.0000e+00],
        [0.0000e+00, 1.9910e+03, 0.0000e+00, 1.0000e+00, 6.4010e+03, 7.7660e-01,
         4.8171e+08, 7.9760e+03, 0.0000e+00],
        [0.0000e+00, 2.0000e+03, 0.0000e+00, 0.0000e+00, 6.4455e+05, 9.1700e-01,
         1.7132e+11, 4.1099e+04, 1.0000e+00],
        [0.0000e+00, 2.0110e+03, 1.0000e+00, 0.0000e+00, 6.4550e+03, 7.2000e-01,
         3.7745e+09, 4.8620e+03, 0.0000e+00],
        [0.0000e+00, 2.0000e+03, 0.0000e+00, 0.0000e+00, 1.1289e+04, 7.7660e-01,
         6.7254e+07, 9.2800e+02, 0.0000e+00],
        [0.0000e+00, 1.9970e+03, 0.0000e+00, 0.0000e+00, 8.9560e+03, 7.7660e-01,
         3.4777e+08, 3.6130e+03, 0.0000e+00],
        [0.0000e+00, 1.9920e+03, 0.0000e+00, 0.0000e+00, 1.8590e+05, 7.7660e-01,
         2.3166e+09, 6.0300e+02, 0.0000e+00],
        [0.0000e+00, 1.9880e+03, 0.0000e+00, 0.0000e+00, 7.5520e+05, 7.7660e-01,
         3.7514e+11, 1.0250e+04, 0.0000e+00],
        [0.0000e+00, 2.0100e+03, 0.0000e+00, 0.0000e+00, 1.1420e+05, 8.0700e-01,
         5.9830e+10, 1.4232e+04, 0.0000e+00],
        [0.0000e+00, 2.0050e+03, 0.0000e+00, 0.0000e+00, 1.2581e+04, 6.9100e-01,
         9.5121e+08, 6.2920e+03, 0.0000e+00],
        [0.0000e+00, 2.0130e+03, 0.0000e+00, 0.0000e+00, 2.6916e+05, 8.2100e-01,
         3.2540e+10, 2.6793e+04, 0.0000e+00],
        [0.0000e+00, 2.0120e+03, 0.0000e+00, 0.0000e+00, 4.1399e+05, 7.8800e-01,
         5.1264e+10, 1.6264e+04, 0.0000e+00],
        [0.0000e+00, 1.9910e+03, 1.0000e+00, 0.0000e+00, 6.8483e+05, 7.7660e-01,
         1.0514e+11, 1.0816e+04, 0.0000e+00],
        [0.0000e+00, 1.9880e+03, 1.0000e+00, 1.0000e+00, 8.9590e+06, 7.7660e-01,
         3.0717e+12, 2.6687e+04, 0.0000e+00],
        [0.0000e+00, 2.0010e+03, 1.0000e+00, 0.0000e+00, 5.6936e+05, 7.7660e-01,
         2.3992e+11, 2.8429e+04, 0.0000e+00],
        [0.0000e+00, 2.0130e+03, 1.0000e+00, 0.0000e+00, 1.1251e+05, 8.2100e-01,
         3.2540e+10, 2.6793e+04, 0.0000e+00],
        [0.0000e+00, 2.0020e+03, 1.0000e+00, 1.0000e+00, 3.1906e+05, 7.7660e-01,
         1.3955e+11, 2.8390e+04, 0.0000e+00],
        [0.0000e+00, 2.0000e+03, 0.0000e+00, 0.0000e+00, 7.8880e+03, 7.7660e-01,
         6.7254e+07, 9.2800e+02, 1.0000e+00],
        [0.0000e+00, 1.9940e+03, 1.0000e+00, 0.0000e+00, 3.2629e+06, 7.7660e-01,
         2.5744e+11, 8.3280e+03, 0.0000e+00]])
targets: tensor([[ 0.4200],
        [ 7.5500],
        [ 5.1800],
        [ 4.9000],
        [ 0.5600],
        [ 6.1500],
        [ 3.8000],
        [ 0.4200],
        [ 4.0900],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000],
        [ 7.9500],
        [ 0.7400],
        [ 0.0000],
        [ 6.5100],
        [ 0.5300],
        [ 0.0000],
        [ 7.2100],
        [ 0.0000],
        [ 0.3400]])
In [39]:
jovian.commit(project=project_name, environment=None)
[jovian] Attempting to save notebook..
[jovian] Detected Kaggle notebook...
[jovian] Uploading notebook to

Step 4: Creating a Linear Regression Model

Input size and output size are quite boring in that they’re already predetermined based on your dataset and what features and labels you choose to use. What’s really interesting is the size of the hidden layer (i.e. how many nodes the neural network has at a specific hidden layer). There’s a lot of theories on how to go about choosing the right size for a hidden layer and more importantly for the network as a whole, but at the end of the day, most of this ends up being systematic experimentation and seeing what works best. There’s generally a common rule of thumb to have the hidden size be anywhere between the input and output size, but sometimes that isn’t the best choice for the model. That being said, I’ve experimented with this and found setting the hidden size to be the average between the input and output sizes worked well enough for this model.

In [27]:
input_size = len(input_cols)
output_size = len(output_cols)
hidden_size = (input_size + output_size) // 2
input_size, hidden_size, output_size
(9, 5, 1)

In this model, I create one hidden layer between the input and output layers. I also introduce the use of the ReLu function (f(x) = max(0, x)), which helps introduce nonlinearity to the model. Without the use of the ReLu function, this model would at best model a linear trend in the data, which isn’t always the case. As for the loss function, I’ve chosen to use the l1 loss function over the mean squared error (MSE) loss function mainly because l1 loss performs much better when there are a lot of outliers in the data. When outliers are present, MSE loss tends to grow really fast due to the rapid growth of the parabolic curve it is built upon.

In [28]:
class SuicideRateModel(nn.Module):
    def __init__(self):
        self.linear1 = nn.Linear(input_size, hidden_size) # hidden layer
        self.linear2 = nn.Linear(hidden_size, output_size) # output layer
    def forward(self, xb):
        out = self.linear1(xb)
        out = F.relu(out)
        out = self.linear2(out)
        return out
    def training_step(self, batch):
        inputs, targets = batch
        out = self(inputs)
        loss = F.l1_loss(out, targets)
        return loss
    def validation_step(self, batch):
        inputs, targets = batch
        out = self(inputs)
        loss = F.l1_loss(out, targets)
        return {'val_loss': loss.detach()}
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()
        return {'val_loss': epoch_loss.item()}
    def epoch_end(self, epoch, result, num_epochs):
        if (epoch + 1) % 20 == 0 or epoch == num_epochs - 1:
            print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

Let’s now create a model using the SuicideRateModel class.

In [29]:
model = SuicideRateModel()

We can immediately see that the PyTorch library has by default initialized our model parameters (the weights and biases) to random values. These will be refactored during training as our model learns more about the data.

In [30]:
[Parameter containing:
 tensor([[ 0.2607, -0.2693, -0.2301, -0.1531,  0.2703, -0.3152, -0.3206,  0.1158,
         [-0.2637,  0.0913, -0.0701, -0.0588,  0.0740, -0.1077,  0.2253, -0.3285,
         [-0.1985, -0.1436,  0.2089, -0.2217, -0.1528,  0.2176,  0.2821, -0.1974,
         [-0.3266, -0.1124,  0.2101, -0.2125, -0.2752,  0.0953,  0.1409, -0.2400,
         [ 0.2929, -0.1546, -0.1049,  0.1768,  0.0744, -0.0672, -0.2633, -0.2822,
          -0.2366]], requires_grad=True),
 Parameter containing:
 tensor([-0.0661, -0.1228, -0.0380, -0.1720,  0.3211], requires_grad=True),
 Parameter containing:
 tensor([[0.0549, 0.2363, 0.1551, 0.2776, 0.0373]], requires_grad=True),
 Parameter containing:
 tensor([0.4126], requires_grad=True)]
In [31]:
jovian.commit(project=project_name, environment=None)
[jovian] Attempting to save notebook..
[jovian] Detected Kaggle notebook...
[jovian] Uploading notebook to

Step 5: Training the Model to Fit the Data

We’ll be using the fit function to train the model and the evaluate function to check how well our model is performing at each step.

In [32]:
def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)
def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    training_history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        for batch in train_loader:
            loss = model.training_step(batch)
        training_history.append(evaluate(model, train_loader))
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result, epochs)
    return history, training_history

Let’s quickly use the evaluate function to test our model with the randomly initialized parameters against the validation set before performing any training steps.

In [33]:
result = evaluate(model, val_loader)
{'val_loss': 59966304256.0}

We’re now ready to train our model. This process requires a significant amount of experimentation in terms of tuning the learning rate and number of epochs. There is some quite deep mathematical theory behind the values that do get chosen, and that requires understanding of gradient descent and how calculus works. Essentially, the dumbed down idea is that our goal is to minimize our model’s loss function, which is differentiable on our model’s parameters (i.e. the weights and biases). Let’s imagine a random loss value output from our loss function. If the slope at that point is positive, then the loss function is increasing relative to that position. That means that to decrease the loss, we need to decrease our parameters by subtracting a small proportion of the positive slope value from them. Similarly, if the slope at that point is negative, then the loss function is decreasing relative to that position. This means that to decrease the loss further, we need to increase our parameters by subtracting (yes, subtracting!) from them a small proportion of the negative slope value. This small proportion is what’s known as the learning rate. Having a large learning rate allows the model to learn really fast whereas a smaller learning rate forces the model to learn slowly. However, each learning rate option comes with its own drawbacks. High learning rates can cause the model to overshoot and miss the global minima whereas small learning rates can cause the model to undershoot and get suck at a local minima that may not be optimal. Hence, what tends to be the most logical strategy usually is to start at a high learning rate at first and then start lowering that learning rate such that the model approaches the global minima fast initially and then as it gets closer to the global minima, it slows down the learning rate so as not to overshoot the global minima.

In [34]:
epochs = 5
lr = 1e-1
history1, training_history = fit(epochs, lr, model, train_loader, val_loader)
val_losses = []
training_losses = []

for obj in history1:
for obj in training_history:

plt.plot(val_losses, label='validation loss')
plt.plot(training_losses, label='training loss')
plt.xlabel('Number of Epochs')
Epoch [5], val_loss: 10.8362

From this graph, we can see that both our validation and training losses have dropped quite drastically in just 5 epochs. Let’s see if we can decrease them further with a smaller learning rate.

In [35]:
epochs = 2
lr = 1e-6
history2, training_history = fit(epochs, lr, model, train_loader, val_loader)
val_losses = []
training_losses = []

for obj in history2:
for obj in training_history:

plt.plot(val_losses, label='validation loss')
plt.plot(training_losses, label='training loss')
plt.xlabel('Number of Epochs')
Epoch [2], val_loss: 10.8362

It’s quite apparent from the graph above that the losses have stalled and approached a flat curve. This essentially lets us know that our model has converged to what it thinks is the optimum solution. This can sometimes be inaccurate depending on the choices made for all the hyperparameters discussed earlier, but for this model, I’ve tested with multiple hyperparameters and the loss flattens out at this point regardless, so this stall in the loss is probably caused by the fact that our dataset is relatively small.

Step 6: Evaluating the Model on the Test Data

Let’s go right ahead and test out our model on the testing data to make sure that our testing loss is close enough to our training loss. If our testing loss is much larger than our training loss, that means our model has either overfit the training data by learning so many specific details about the training data that it cannot possibly generalize output to new, unseen data or underfit the training data by not learning enough aspects of the data to be able to make good predictions. If our testing loss is much lower than our training loss, then that means that either we made some mistake in our model construction or our testing data just contains much easier examples than our training data. But here’s the catch! If it’s the latter case and if the shuffling of the training data isn’t based on an assigned random seed (i.e. training data order is same each time we run), then running multiple times should not produce the same loss comparison between the training and testing data. So if you do run multiple times and each time you get the testing loss much lower than the training loss, then there is a high likelihood that there’s a logical error somewhere down the path of model construction.

In [36]:
result = evaluate(model, test_loader)
{'val_loss': 10.995272636413574}

Here, we can see that our testing loss is very close to our training loss, and thus, we haven’t underfit nor overfit the training data. It looks like our model has performed just as we expected it to do so based on the output of the training steps above.

In [49]:
jovian.commit(project=project_name, environment=None)
[jovian] Attempting to save notebook..
[jovian] Detected Kaggle notebook...
[jovian] Uploading notebook to
In [ ]:

Artificial Intelligence · Deep Learning · Guides · Linear Algebra · Machine Learning · Mathematics · Programming Languages · Python · PyTorch

Exploring 5 PyTorch Functions

Derived from here

Hi everyone! I recently discovered a free, live-streamed 6-week PyTorch deep learning course on YouTube and decided to commit to it in the spare time that I have. The following is a copy of the Jupyter Notebook instance I wrote for the first assignment. This notebook can also be found on GitHub here. I hope you find this resource useful!

Exploring PyTorch Tensor Functions

PyTorch is an open source machine learning library that allows building deep learning projects at high speeds and with easy flexibility. PyTorch also serves as an advancement of NumPy with its incorporation of GPU power. For the remainder of this document, we will be taking a look at the following Pytorch tensor functions:

  • torch.numel
  • torch.logspace
  • torch.full
  • torch.narrow
In [1]:
# Import torch and other required modules
import torch

Function 1 – torch.numel

The torch.numel function takes a tensor as an input parameter and returns a count of the elements contained by that tensor at all dimensions.

In [2]:
# Example 1 - working
x = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

In this example, we have a square matrix tensor of dimensions $3 \times 3$, meaning that there are $3 \times 3 = 9$ elements contained in total.

In [3]:
# Example 2 - working
x = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])

Here, we’re no longer using a square matrix tensor. Now our tensor has dimensions $3 \times 2 \times 2$ (i.e. $3$ rows, each split into $2$ columns, each of which has space for $2$ values). It might help to visualize this tensor as a rectangular prism of length $3$, width $2$, and height $2$. That being said, the number of elements contained in total is the volume of the prism itself, which is $3 \times 2 \times 2 = 12$.

In [4]:
# Example 3 - breaking (to illustrate when it breaks)
t = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
TypeError                                 Traceback (most recent call last)
      1 # Example 3 - breaking (to illustrate when it breaks)
      2 t = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
----> 3 torch.numel(t)

TypeError: numel(): argument 'input' (position 1) must be Tensor, not list

The failure here is caused by the fact that the function torch.numel only takes a tensor as data input. Although the list t constructed in this example is formatted in the style of a tensor, the torch.numel function still cannot accept it as an input because it has no clue beforehand whether the list given to it is in tensor format or some other incompatible format (i.e. a list of strings). Hence, this function demands that the input data is encapsulated in a tensor object to guarantee that the input it receives is actually a number, numeric vector, or a numeric matrix.

The function torch.numel is used in a variety of cases. It can be used to compute the average (mean) value across a tensor or even to compute the size of a tensor that was loaded from an external source such as a file or a scraped site. Any time the number of elements contained by a tensor is needed, this function is the perfect tool for the job.

Function 2 – torch.logspace

The torch.logspace function outputs a 1-dimensional tensor containing logarithmically spaced values starting at base$^{start}$ and ending at base$^{end}$, where (base, start, end) are parameters set by the user. The number of points outputted in this tensor depends on the specified steps parameter value.

In [ ]:
# Example 1 - working
torch.logspace(start=0, end=5, steps=6, base=2)

This is a tensor containing values ranging from $2^{0} = 1$ to $2^{5} = 32$ inclusive. But how exactly are the intermediate values determined? Well, we have $end – start + 1 = 5 – 0 + 1 = 6$ values in the range $\lbrack$start, end$\rbrack$. We need to split these $6$ values into $6$ steps, and thus we do $\cfrac{6}{6} = 1$. This tells us that the powers chosen for the base values included in the tensor will be a distance of $1$ unit apart. Given this information as well as the base value of $2$, we arrive at our tensor values: $2^{0}, 2^{1}, 2^{2}, 2^{3}, 2^{4}, 2^{5}$, which are equivalent to the output shown above.

In [ ]:
# Example 2 - working
torch.logspace(start=0, end=5, steps=5, base=2, dtype=torch.int32)

$\def\lc{\left\lceil}\def\rc{\right\rceil}$The resulting tensor this time around again contains values ranging from $2^{0} = 1$ to $2^{5} = 32$ inclusive. However, the intermediate values are arrived at differently not only because the steps parameter has been modified, but also because now we’ve specified an output data type of $32$-bit integers, which means that floating point result values will be truncated. That being said, as shown in the previous example, there are $6$ values in the range $\lbrack$start, end$\rbrack$. We need to split these $6$ values into $5$ steps, and thus we do $\cfrac{6}{5} = 1.2$, which means that the powers chosen for the base values included in the tensor will be a distance of $1.2$ apart. Given this information as well as the base value of $2$, we arrive at our tensor values: $\lc 2^{0} \rc, \lc 2^{1.2} \rc, \lc 2^{2.4} \rc, \lc 2^{3.6} \rc, \lc 2^{4.8} \rc$, which are equivialent to the output shown above.

In [5]:
# Example 3 - breaking (to illustrate when it breaks)
x = torch.randn(4, dtype=torch.float32)
torch.logspace(start=1, end=4, steps=4, base=2, out=x, dtype=torch.int32)
RuntimeError                              Traceback (most recent call last)
      1 # Example 3 - breaking (to illustrate when it breaks)
      2 x = torch.randn(4, dtype=torch.float32)
----> 3 torch.logspace(start=1, end=4, steps=4, base=2, out=x, dtype=torch.int32)

RuntimeError: dtype Int does not match dtype of out parameter (Float)

We get an error here because of mismatching types. When calling the torch.logspace function, we specified that we wanted the output tensor to be stored in the tensor x, which is completely fine to do. However, we also specified that the output tensor’s type should be of $32$-bit integers, which is in conflict with the passed in output tensor x‘s type of $32$-bit floating point values.

The torch.logspace function is often used to create frequency vectors containing values within a specified range. For example, this proves to be beneficial when testing out multiple learning rate values to see which leads to better optimization of a machine learning algorithm. Before using this function, one needs to understand when exactly they need their data logarithmically spaced out because sometimes, it might be better to use a linear spacing instead. So when would you want to use one data spacing type over the other? Simply put, if you’re modeling something that relies on some internal relative change ($\textit{multiplicative}$) mechanism, then a logarithmic spacing would allow you to capture the patterns in this mechanism more accurately than a linear spacing would. Similarly, if you’re modeling something that relies on some internal absolute change ($\textit{additive}$) mechanism, then a linear spacing would allow you to capture the patterns in this mechanism more accurately than a logarithmic spacing would.

Function 3 – torch.full

The torch.full function creates a tensor of specified size, prefills it entirely with a value specified by the value parameter, and then returns that tensor as output.

In [6]:
# Example 1 - working
torch.full((2, 4), 3.0)
tensor([[3., 3., 3., 3.],
        [3., 3., 3., 3.]])

In this example, we can see that the function returned a matrix tensor of dimensions $2 \times 4$ ($2$ rows by $4$ columns) with all entry values set to $3.0$.

In [7]:
# Example 2 - working
torch.full((1, 1), -10.0)

Here, we have a simple example of a $1 \times 1$ matrix($1$ row by $1$ column) that has its one and only entry set to a value of $-10.0$.

In [8]:
# Example 3 - breaking (to illustrate when it breaks)
x = list()
torch.full((2, 4), 3.7, out=x)
TypeError                                 Traceback (most recent call last)
      1 # Example 3 - breaking (to illustrate when it breaks)
      2 x = list()
----> 3 torch.full((2, 4), 3.7, out=x)

TypeError: full() received an invalid combination of arguments - got (tuple, float, out=list), but expected one of:
 * (tuple of ints size, Number fill_value, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (tuple of ints size, Number fill_value, *, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

The out parameter in the torch.full function is used to determine where the output tensor is to be stored. However, by providing a list as the output type instead of a tensor, this results in a TypeError since PyTorch expects to be passed a tensor as the out parameter and not a list.

The torch.full function is used whenever one needs to create a tensor that’s loaded with the same specified value at all possible positions. For example, this can be used for initializing a weight tensor with an initial value that’s the same for each weight. There are also certain linear algebraic operations that require creating a vector tensor containg the same value at each position and then concatenating that to an other tensor to be able to compute a specific output correctly.

Function 4 –

The function takes an input tuple of tensors as well as a dimension dim, and then returns an output tensor that is a concatenation of the input tensors at the specified dimension.

In [9]:
# Example 1 - working
x = torch.tensor([
    [7, 8]
]), x), dim=0)
tensor([[3, 4],
        [5, 6],
        [7, 8],
        [3, 4],
        [5, 6],
        [7, 8]])

The tensor x in this example has two dimensions. The first dimension (dim $= 0$) encompasses all the elements wrapped by the outer array (i.e. the $3$ inner arrays). Similarly, the second dimension (dim $= 1$) encompasses all the values contained by each of the $3$ inner arrays that are wrapped by the outer array. Hence, when we concatenate x with itself at dim $= 0$, we’re essentially duplicating the $3$ inner arrays since those are x's contents at dim $= 0$.

In [10]:
# Example 2 - working
x = torch.tensor([
    [7, 8]
]), x), dim=1)
tensor([[3, 4, 3, 4],
        [5, 6, 5, 6],
        [7, 8, 7, 8]])

As mentioned above, the tensor x in this example has two dimensions. That being said, when we concatenate x with itself at dim $= 1$, we’re essentially duplicating the values contained by each of the 3 inner arrays (i.e. $[3,4] \implies[3, 4, 3, 4]$) since those are x's contents at dim $= 1$.

In [11]:
# Example 3 - breaking (to illustrate when it breaks)
x = torch.tensor([
    [3, 4],
    [5, 6],
    [7, 8]
y = torch.tensor([
    [3, 4, 10],
    [5, 6, 11],
    [7, 8, 13]
]), y), dim=0)
RuntimeError                              Traceback (most recent call last)
     10     [7, 8, 13]
     11 ])
---> 12, y), dim=0)

RuntimeError: Sizes of tensors must match except in dimension 0. Got 2 and 3 in dimension 1

The reason why the function fails here is because it expects all input tensors to have the same size for all dimensions $\textbf{except}$ for the dimension at which the concatenation is occurring. In this example, we specified the concatenation at dim $= 0$, and thus need to make sure that x and y have the same size at dim $= 1$, which clearly isn’t the case ($2 \neq 3$) and is the reason behind the error above.

The function can be used whenever a sequence of tensors need to be joined for a computation to proceed. An example use case of this is in the construction of recurrent neural networks (RNN), which use the function to continuously join hidden input and output states. Another example use case is in the establishment of data parallelism, which involves breaking down input data into minibatches and operating on those minibatches in parallel to improve performance.

Function 5 – torch.narrow

The torch.narrow function takes in an input tensor, a dimension dim, a starting position start, and a distance length to move from the starting position. The output tensor contains only the elements at the specified dimension dim that are within the index range $\lbrack$start, start + length$)$, hence the term narrowing

In [12]:
# Example 1 - working
x = torch.tensor([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12]
torch.narrow(x, dim=0, start=1, length=2)
tensor([[ 5,  6,  7,  8],
        [ 9, 10, 11, 12]])

At dim $=0$, x has $3$ inner vectors. Each of these vectors is at an index in dim $= 0$ of x. Hence, we have indices $0$, $1$, and $2$. For this example, we’ve specified that we would like to start at index $1$ and only include $2$ items from dim $= 0$ from there onward. This means that the torch.narrow function will only output the inner vectors at indices $1$ and $2$.

In [13]:
# Example 2 - working
x = torch.tensor([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15]
torch.narrow(x, dim=1, start=1, length=3)
tensor([[ 2,  3,  4],
        [ 7,  8,  9],
        [12, 13, 14]])

At dim $= 1$, we’re operating on eah of the three inner vectors contained in dim $= 0$ of x. To narrow each of these vectors down, we will only select $2$ values in the index range $[1, 3]$ (or equivalently the second, third, and fourth values) from each vector.

In [15]:
# Example 3 - breaking (to illustrate when it breaks)
x = torch.tensor([
    [1, 2],
    [3, 4]
torch.narrow(x, dim=0, start=1, length = 2)
RuntimeError                              Traceback (most recent call last)
      4     [3, 4]
      5 ])
----> 6 torch.narrow(x, dim=0, start=1, length = 2)

RuntimeError: start (1) + length (2) exceeds dimension size (2).

The reason why this fails is because there are only two vectors at dim $= 0$ of x and we are asking the torch.narrow function to get us two vectors starting at index $1$ (i.e. the second vector of x), which would require that there be a vector after $[3, 4]$, but there isn’t one, and thus an error is thrown.

The torch.narrow function comes in very handy when we need to perform a computation that only requires a specific chunk from some tensor. One of the especially important features of the torch.narrow function is that it allows us to extract a tensor chunk from a larger tensor without making a memory copy of the original tensor. That is, the new tensor that is returned references the same storage point as the original tensor that is being narrowed. Without this function, we’d have to make deep copies of tensor data every time we wanted to reuse that data, which is very inefficient in terms of memory management. Hence, the torch.narrow function solves this exact problem.


In this notebook, we’ve explored a small set of PyTorch’s vast amount of brilliant functions. Although I only shed light on $5$ functions in this document, I learned a rich amount of knowledge researching the applications of these functions, which in turn forced me to learn about other functions. If you would like to learn more about these functions and other PyTorch functions, please check out the official documentation.

Provide links to your references and other interesting articles about tensors

In [20]:
!pip install jovian --upgrade --quiet
In [21]:
import jovian
In [22]:
[jovian] Attempting to save notebook..
[jovian] Updating notebook "khalilhijazi/01-tensor-operations" on
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully!
In [ ]:

Javascript · Linear Algebra · Massey Ranking · Mathematics · Programming Languages · Python · Side Projects · Sports · Web Scraping

NFL Team Ranking Approximation From 1970 – Current

Derived from here

Yesterday was the last day of my semester and a great day to put one of my favorite models, the Massey Method, from linear algebra to good use. As I revisited my linear algebra book to look back at all that we’ve learned through out the semester, the idea of finding out how NFL teams ranked from day one till now caught my interest. That being said, I committed to finding an approximation to that using the Massey Method and here I will share what I’ve found.

Finding the Data

I first checked NFL’s site to see how far schedules went back and noticed it was from 1970 till now. So I went to search Google for an API or dataset that I could use to give me the scores of each game but unfortunately, I wasn’t able to find anything. Well, I lied. I found some resources but they weren’t exactly what I was looking for. So I ended up writing a python script (source code here) that crawled the week page of each season and scraped the scores and teams for all games.

How the Massey Method Works

The Massey Method works based on the amount of games played between teams and point differential across games. The ranking for each team is found using least squares approximation. Here’s the general algorithm for the Massey Method:

    1. Write down a system of equations for every single match played. An equation that relates two teams r1 and r2 with a point differential d can be expressed as r1 – r2 = d.
    1. Convert that into a system of equations of the form Ar̄ = p̄.
    1. Reach the least squares system of the form ATAr̂ = ATp̄.
    1. Change the left matrix of this system by setting the last row to a row consisting of solely the number 1. Also change the bottom entry of the new right vector to a 0.
    1. Now solve the system to determine the ranks.

Final Results

In total, over the span of around 49 years of NFL history, 11,652 games were played across 32 teams. This is only counting the regular season and playoff games. Preseason games are skewed and not worth much adding into this data as coaches use that time mainly to test out rookies and evaluate their team as a whole unit. That being said, here are the results that my script returned:

Team Rank
Pittsburgh Steelers 3.707942162671477
Dallas Cowboys 2.9660468156960293
Baltimore Ravens 2.848222524156146
New England Patriots 2.666908177632458
Denver Broncos 2.3116494180207723
San Francisco 49ers 2.190527900025692
Miami Dolphins 2.0144090483449753
Minnesota Vikings 1.3172208931756877
Greenbay Packers 1.175465583138116
Washington Redskins 1.0590513996484625
Philadelphia Eagles 0.9921052597864558
Kansas City Chiefs 0.8187838466188585
Seattle Seahawks 0.7669103717005967
Los Angeles Chargers 0.47777279702627246
Oakland Raiders 0.37408321617291307
Los Angeles Rams -0.2173506542743492
New York Giants -0.2642147178334522
Carolina Panthers -0.3937363102811801
Chicago Bears -0.6790915673325525
Cincinnati Bengals -0.7162153192730953
Indianapolis Colts -0.773696735004739
Buffalo Bills -1.1664086499973125
Jacksonville Jaguars -1.367965684118028
New York Jets -1.3913297554590114
New Orlean Saints -1.4872441285045244
Tennessee Titans -1.65339271127078
Atlanta Falcons -1.9260637829642524
Detroit Lions -2.0466076262955615
Houston Texans -2.281760321146092
Cleveland Browns -2.6836219194417277
Arizona Cardinals -2.979684378641172
Tampa Bay Buccaneers -3.6587151519770678

Analysis of Results

As you may have already noticed, the numbers in the table above are sorted in decreasing order. That is, the highest rank is at the top while the lowest rank is at the bottom of the table respectively. However, what does that really mean in terms of the question we asked at start: which teams have the greatest point differential? From the data above, the higher a team ranks, the greater is its point differential placement. What that means, for example, is that the Pittsburgh Steelers tend to win games by a greater point differential than the Dallas Cowboys and that the Dallas Cowboys tend to win games by a greater point differential than the Baltimore Ravens, and so on and so forth. It is important to note that this does not necessarily mean that one team is better than the other if it has a higher point differential ranking.


There are many ways this data could’ve been used for meaningful information, but I found this approach particularly interesting because of two reasons: (1) I’m very fond of the Massey Method, specifically the least squares approximation aspect of it and (2) I wanted to see which teams tend to score more than their opponents on average. At the very least, this project was very interesting and a great way to test data approximation models on real data and that is something I value. That being said, I hope you found this helpful or at least an interesting read. As usual, feel free to express your opinions and concerns below and I’d be glad to respond back.

Guides · Programming Languages · Side Projects

Starting Projects the Right Way

Derived from here

It’s been a year now since I’ve started university and in that timespan, I have created many projects, some of which turned out successful and others with the exact opposite result. What that did teach me, however, is that there are certain common factors that sort’ve hint at whether a project is likely to succeed or not. Through this year of experience as well as a fair share of research on my end, I learned a lot about establishing outstanding side projects. I am by no means an expert in this matter but I do have a thing or two up my sleeve, so stay with me!

Rule # 1: Plan Everything Out (or at least the basics)

“By failing to prepare, you are preparing to fail.” – Benjamin Franklin

To make a great project, you want to invest time in laying out a general piece of how the UI will look, what functionality you’ll need, what libraries and modules you’ll be using, how you’re planning to set up the backend, and so on. You need to understand the problem you’re solving, the audience you’re addressing, and the platform you’re targeting. A common problem I see is that people often decide on a solution, algorithm, language, or scheme before really grasping what the problem is truly about. Dig deep into the matter and understand what, how, and why you’re pursuing this project. Not only does having a plan reduce proneness to failure, but it also allows you to set reasonable goals (daily, weekly, monthly, yearly) that will only increase your productivity and drive to finish that project.

Rule # 2: Understand How the Tools You’ve Chosen Actually Work

“Risk comes from not knowing what you’re doing.” – Warren Buffett

Understanding how certain libraries and packages work is essential to saving time on projects, reducing hacky code, and keeping you at an overall positive increase in momentum over time. A few months ago, I decided to use React, a JavaScript library, along with a and Express server backend to make an online word game. Let’s just say, it wasn’t exactly the best experience one can have building a project. To start off, I did not know how any of those tools worked and this project was my first exposure to them. I spent three months making what someone can make in a week or so just because I constantly had to reformat my code and replan things out. I didn’t understand how React’s state management worked and how the web sockets interacted with the Express server to emit messages between each other and in subgroups. Learning how these tools worked beforehand would sure have saved me a decent amount of time and stress as well as enhanced the final product.

As a suggestion, I’d recommend making basic mini projects to test out these products before actually using them in the project you want to pursue. As an example, I’m currently interested in building a web application, and for that, I’ve decided to use an Express and DynamoDB backend. Both of these two elements are still a bit foreign to me, but I spent the past week (and will continue to do so this week) learning and testing them in small projects so that I really understand their quirks before actually using them in the web application I want to build.

Rule # 3: Stick to YAGNI (You Ain’t Gonna Need It)

“Always implement things when you actually need them, never when you just foresee that you need them.” – Ron Jeffries

YAGNI is a principle of that states that programmers must not add a feature unless it is deemed necessary. Sometimes you may feel that adding something is needed but after giving it thought, you actually realize that it isn’t necessary at all. Understand what’s crucial to your project and what’s not, and then build on that. At the end of the day, your project is meant to focus on what other people need, not what you want. If you satisfy your own needs, you’ve satisfied one person but if you satisfy other people’s needs, you’ve satisfied a significant amount of people. If you can satisfy both, then that’s awesome, but if not, then stick to the latter as it obviously brings your more traffic and success.

Rule # 4: Don’t Jump the Gun Too Early

“Nothing good ever comes out of hurry and frustration, only misery.” – Auliq Ice

As with other processes, you want to be patient throughout this journey. A lot of beginners fall into the habit of wanting to go too big too soon and end up collapsing because they rushed everything. Set incremental goals and patiently continue to hit them until you’ve hit the benchmark you’ve set to achieve. Success isn’t something that happens overnight; it’s a process that is a result of many days of consistent hard work.

Rule # 5: It’s Okay To Ask For Help 🙂

“I think the hardest part to get to is that point of asking for help or reaching out to other people and being honest with yourself.” – Mary-Kate Olsen

When I first started building personal projects, I felt like being independent meant that I didn’t ask anyone for help and that I figured solutions to my problems on my own.  While that is somewhat true, I now understand that asking for help doesn’t reduce your independence at all but rather increases it as it allows you to break through barriers and continue progressing in an upward trend.

No matter what you’re doing, there’s always going to be expert individuals who know the craft more than you do, so there’s no wrong in asking them questions about how to do certain things. The catch here is that you don’t want to rush to them whenever you run into a problem. Take action on your own and do research. Only when you don’t find answers to your questions do you refer to these professionals.

In terms of asking the right people, you want to be asking people who have knowledge in whatever stack you’re working with. If you’re working with web related stuff, find a web developer who can help. If you’re working with machine learning, find someone who’s advanced in that and seek their help. In other words, take technical direction from someone who’s mastered what you’re dealing with.

Rule # 6: Don’t Get Stuck on Minutiae (Bikeshedding)

“Don’t let small things stop you from accomplishing bigger and better things.” – Sonya Parker

Bikeshedding occurs when you spend too much of your time and energy on trivial details of a larger concern. Instead, focus on the more important aspects first, and then if you have time, clear up the remaining, insignificant features. An analogy I like to think of is standardized testing. On these exams, you’re being measured for your accuracy on a certain time interval. If you get stuck on one problem and waste time by continuing to solve it, then you may not get enough time to finish the problems at the end, which may very well be easier than the one you got stuck on. The general rule of thumb is if you get stuck on one problem, then just skip it, and if you have time later, come back and attempt it again. Similarly, you want to be taking advantage of the time you have to build your project. Don’t waste your time on insignificant aspects of the project when more important features are to be attended to.

Final Overview

These are the most common rules that I’ve found myself violating this past year as well as heard or seen others violate which resulted in failed projects at an early stage. If you have any other rules that you’d like to suggest, feel free to pitch in and post a comment for everyone to learn something from you. I am learning from you all just as you all are learning a thing or two from me.