Hi everyone! I’m taking an online deep learning with PyTorch course, which has turned out to be a really enjoyable experience. That being said, for our second assignment, our core focus was on building a Linear Regression model that predicts insurance charges. I sure finished that assignment. However, to spice things up a bit, I took a decision for myself to attempt using a different dataset to build a similar model that instead predicts suicide rates. Since the objective of this assignment was to fully understand Linear Regression and how to build a model of that type, I decided to stick with that for this optional project as well. The Jupyter Notebook I’ve written can be found below as well as on GitHub here
Predicting Suicide Rates Using Linear Regression¶
In this notebook, we’ll be using information such as country name, gender, age group, population count, GDP values, and other features to predict the suicide rate for a specific group of people at any given year. This kind of model is useful for suicide prevention and mental health hotline companies in terms of allowing them to figure out the trend in growth of this dilemma as well as in terms of monitoring their progress towards reducing these suicide rates. The dataset for this problem is taken from: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
We will create a model with the following steps:
- Downloading and Cleaning the Data
- Exploring the Data
- Preparing the Dataset for Training
- Creating a Linear Regression Model
- Training the Model to Fit the Data
- Evaluating the Model on the Test Data
!pip install jovian --upgrade --quiet
import torch
import jovian
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split
# Configuring styles
sns.set_style("darkgrid")
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
project_name='02-custom-sucide-linear-regression' # will be used by jovian.commit
Step 1: Downloading and Cleaning the Data¶
To load the dataset into memory, we’ll make use of the read_csv function from the Pandas library.We will then load the data as a Pandas dataframe.
dataframe = pd.read_csv('/kaggle/input/suicide-rates-overview-1985-to-2016/master.csv')
To clean the data, we first start by removing unnecessary columns such as suicides_no, country-year, and generation since these columns are simple transformations of other provided columns and do not provide any new information. We then relabel the gdp_for_year (\$) column since the original dataset has the string with an extra whitespace. Lastly, we convert the gdp_for_year (\$) column to an integer by striping away the commas in the string and then converting the values to type integer.
dataframe = dataframe.drop(['suicides_no', 'country-year', 'generation'], axis=1) # dropping duplicate columns that aren't necessary
dataframe.rename(columns = {' gdp_for_year ($) ':'gdp_for_year ($)'}, inplace = True) # renaming column to remove extra whitespace
dataframe['gdp_for_year ($)'] = dataframe['gdp_for_year ($)'].str.replace(',', '').astype('int') # converting data type to integer
dataframe = dataframe.fillna(dataframe.mean())
Step 2: Exploring the Data¶
Let’s first see what our data looks like as a starting step.
dataframe.head()
Let’s also see what data types we’re working with to confirm that everything’s fine.
dataframe.dtypes
Now, let’s answer some basic questions about our dataset.
Q: How many rows are there in the dataset?¶
num_rows = dataframe.shape[0]
num_rows
Q: How many columns are there in the dataset?¶
num_cols = dataframe.shape[1]
num_cols
Q: What are the names of our input features?¶
input_cols = [col for col in dataframe.columns]
input_cols.remove('suicides/100k pop')
input_cols
Q: What are the names of our categorical features?¶
categorical_cols = [col for col in dataframe.select_dtypes(exclude=['number']).columns]
categorical_cols
Q: What is label that we are predicting?¶
output_cols = ['suicides/100k pop']
output_cols
Q: What does the distribution of label values look like?¶
plt.title("Distribution of Suicides per 100k population")
sns.distplot(dataframe['suicides/100k pop']);
Q: What are the minimum, average, and maximum suicide rates?¶
print("Minimum Suicides per 100k Population: {}".format(dataframe['suicides/100k pop'].min()))
print("Maximum Suicides per 100k Population: {}".format(dataframe['suicides/100k pop'].max()))
print("Average Suicides per 100k Population: {}".format(dataframe['suicides/100k pop'].mean()))
jovian.commit(project=project_name, environment=None)
Step 3: Preparing the Dataset for Training¶
We’ll first begin by converting our Pandas dataframe to a series of numpy arrays. The numeric features will be directly converted to numpy arrays. Categorical features, however, need to be transferred to an appropriate representation. We can do such a transformation by converting the categorical features to dummy binary predictor variables that specify which value of the fixed set of values for that feature is present at a given row entry.
def dataframe_to_arrays(dataframe):
# Make a copy of the original dataframe
temp = dataframe.copy(deep=True)
# Convert non-numeric categorical columns to numbers
for col in categorical_cols:
temp[col] = pd.get_dummies(temp[col])
# Extract input & outputs as numpy arrays
inputs_array = temp[input_cols].to_numpy()
targets_array = temp[output_cols].to_numpy()
return inputs_array, targets_array
inputs_array, targets_array = dataframe_to_arrays(dataframe)
inputs_array, targets_array
Now that we’ve converted our inputs and targets to numpy arrays, it’s time to convert them to PyTorch tensors.
inputs = torch.from_numpy(inputs_array).type(torch.float32)
targets = torch.from_numpy(targets_array).type(torch.float32)
Let’s check the types to make sure that they’re of type float32.
inputs.dtype, targets.dtype
Next, we’ll create our dataset and dataloaders that are needed for training and testing.
dataset = TensorDataset(inputs, targets)
Here, we tweak the percentage of the data owned by each of the training, validation, and testing sets. Then, we randomly split these datasets so that the order of the original data does not have any impact on the results.
val_percent = 0.15
test_percent = 0.15
val_size = int(num_rows * val_percent)
test_size = int(num_rows * test_percent)
train_size = num_rows - val_size - test_size
train_ds, val_ds, test_ds = random_split(dataset, [train_size, val_size, test_size]) # randomly splitting the dataset into train, validation, and test datasets
Batch size is something important in a network. The larger the batch size is, the closer our mini-batch stochastic gradient descent results become relative to a full stochastic gradient descent on the entire data. The reason why we operate on batches is to improve speed, but in return we sacrifice a bit of accuracy in our model for that performance gain. Hence, the larger our batch size is, the more accurate, yet slow learning, our network becomes.
batch_size = 32
In this step, the dataloaders are created. It is important to understand why the training loader is shuffled while the validation and test loaders aren’t. The reasoning behind this approach is that we would not like our model to learn based on the order of the data that is being fed to it because that will bias the model’s understanding of the data towards a very specific direction, which is clearly something we don’t want. But why don’t we shuffle the other loaders? The reason why this is so is because when testing and validating, we really do not care about the order of elements we’re evaluating our model against; all that really matters is how our model performs, which won’t be any different based on order of the data.
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)
test_loader = DataLoader(test_ds, batch_size)
Let’s take a quick look at what one of our batches looks like.
for xb, yb in train_loader:
print("inputs:", xb)
print("targets:", yb)
break
jovian.commit(project=project_name, environment=None)
Step 4: Creating a Linear Regression Model¶
Input size and output size are quite boring in that they’re already predetermined based on your dataset and what features and labels you choose to use. What’s really interesting is the size of the hidden layer (i.e. how many nodes the neural network has at a specific hidden layer). There’s a lot of theories on how to go about choosing the right size for a hidden layer and more importantly for the network as a whole, but at the end of the day, most of this ends up being systematic experimentation and seeing what works best. There’s generally a common rule of thumb to have the hidden size be anywhere between the input and output size, but sometimes that isn’t the best choice for the model. That being said, I’ve experimented with this and found setting the hidden size to be the average between the input and output sizes worked well enough for this model.
input_size = len(input_cols)
output_size = len(output_cols)
hidden_size = (input_size + output_size) // 2
input_size, hidden_size, output_size
In this model, I create one hidden layer between the input and output layers. I also introduce the use of the ReLu function (f(x) = max(0, x)), which helps introduce nonlinearity to the model. Without the use of the ReLu function, this model would at best model a linear trend in the data, which isn’t always the case. As for the loss function, I’ve chosen to use the l1 loss function over the mean squared error (MSE) loss function mainly because l1 loss performs much better when there are a lot of outliers in the data. When outliers are present, MSE loss tends to grow really fast due to the rapid growth of the parabolic curve it is built upon.
class SuicideRateModel(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = nn.Linear(input_size, hidden_size) # hidden layer
self.linear2 = nn.Linear(hidden_size, output_size) # output layer
def forward(self, xb):
out = self.linear1(xb)
out = F.relu(out)
out = self.linear2(out)
return out
def training_step(self, batch):
inputs, targets = batch
out = self(inputs)
loss = F.l1_loss(out, targets)
return loss
def validation_step(self, batch):
inputs, targets = batch
out = self(inputs)
loss = F.l1_loss(out, targets)
return {'val_loss': loss.detach()}
def validation_epoch_end(self, outputs):
batch_losses = [x['val_loss'] for x in outputs]
epoch_loss = torch.stack(batch_losses).mean()
return {'val_loss': epoch_loss.item()}
def epoch_end(self, epoch, result, num_epochs):
if (epoch + 1) % 20 == 0 or epoch == num_epochs - 1:
print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))
Let’s now create a model using the SuicideRateModel class.
model = SuicideRateModel()
We can immediately see that the PyTorch library has by default initialized our model parameters (the weights and biases) to random values. These will be refactored during training as our model learns more about the data.
list(model.parameters())
jovian.commit(project=project_name, environment=None)
Step 5: Training the Model to Fit the Data¶
We’ll be using the fit function to train the model and the evaluate function to check how well our model is performing at each step.
def evaluate(model, val_loader):
outputs = [model.validation_step(batch) for batch in val_loader]
return model.validation_epoch_end(outputs)
def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
history = []
training_history = []
optimizer = opt_func(model.parameters(), lr)
for epoch in range(epochs):
for batch in train_loader:
loss = model.training_step(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
training_history.append(evaluate(model, train_loader))
result = evaluate(model, val_loader)
model.epoch_end(epoch, result, epochs)
history.append(result)
return history, training_history
Let’s quickly use the evaluate function to test our model with the randomly initialized parameters against the validation set before performing any training steps.
result = evaluate(model, val_loader)
print(result)
We’re now ready to train our model. This process requires a significant amount of experimentation in terms of tuning the learning rate and number of epochs. There is some quite deep mathematical theory behind the values that do get chosen, and that requires understanding of gradient descent and how calculus works. Essentially, the dumbed down idea is that our goal is to minimize our model’s loss function, which is differentiable on our model’s parameters (i.e. the weights and biases). Let’s imagine a random loss value output from our loss function. If the slope at that point is positive, then the loss function is increasing relative to that position. That means that to decrease the loss, we need to decrease our parameters by subtracting a small proportion of the positive slope value from them. Similarly, if the slope at that point is negative, then the loss function is decreasing relative to that position. This means that to decrease the loss further, we need to increase our parameters by subtracting (yes, subtracting!) from them a small proportion of the negative slope value. This small proportion is what’s known as the learning rate. Having a large learning rate allows the model to learn really fast whereas a smaller learning rate forces the model to learn slowly. However, each learning rate option comes with its own drawbacks. High learning rates can cause the model to overshoot and miss the global minima whereas small learning rates can cause the model to undershoot and get suck at a local minima that may not be optimal. Hence, what tends to be the most logical strategy usually is to start at a high learning rate at first and then start lowering that learning rate such that the model approaches the global minima fast initially and then as it gets closer to the global minima, it slows down the learning rate so as not to overshoot the global minima.
epochs = 5
lr = 1e-1
history1, training_history = fit(epochs, lr, model, train_loader, val_loader)
val_losses = []
training_losses = []
for obj in history1:
val_losses.append(obj['val_loss'])
for obj in training_history:
training_losses.append(obj['val_loss'])
plt.plot(val_losses, label='validation loss')
plt.plot(training_losses, label='training loss')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show();
From this graph, we can see that both our validation and training losses have dropped quite drastically in just 5 epochs. Let’s see if we can decrease them further with a smaller learning rate.
epochs = 2
lr = 1e-6
history2, training_history = fit(epochs, lr, model, train_loader, val_loader)
val_losses = []
training_losses = []
for obj in history2:
val_losses.append(obj['val_loss'])
for obj in training_history:
training_losses.append(obj['val_loss'])
plt.plot(val_losses, label='validation loss')
plt.plot(training_losses, label='training loss')
plt.xlabel('Number of Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show();
It’s quite apparent from the graph above that the losses have stalled and approached a flat curve. This essentially lets us know that our model has converged to what it thinks is the optimum solution. This can sometimes be inaccurate depending on the choices made for all the hyperparameters discussed earlier, but for this model, I’ve tested with multiple hyperparameters and the loss flattens out at this point regardless, so this stall in the loss is probably caused by the fact that our dataset is relatively small.
Step 6: Evaluating the Model on the Test Data¶
Let’s go right ahead and test out our model on the testing data to make sure that our testing loss is close enough to our training loss. If our testing loss is much larger than our training loss, that means our model has either overfit the training data by learning so many specific details about the training data that it cannot possibly generalize output to new, unseen data or underfit the training data by not learning enough aspects of the data to be able to make good predictions. If our testing loss is much lower than our training loss, then that means that either we made some mistake in our model construction or our testing data just contains much easier examples than our training data. But here’s the catch! If it’s the latter case and if the shuffling of the training data isn’t based on an assigned random seed (i.e. training data order is same each time we run), then running multiple times should not produce the same loss comparison between the training and testing data. So if you do run multiple times and each time you get the testing loss much lower than the training loss, then there is a high likelihood that there’s a logical error somewhere down the path of model construction.
result = evaluate(model, test_loader)
result
Here, we can see that our testing loss is very close to our training loss, and thus, we haven’t underfit nor overfit the training data. It looks like our model has performed just as we expected it to do so based on the output of the training steps above.
jovian.commit(project=project_name, environment=None)
Sources
- Featured image can be found here.