Fastai Course Chapter 4 Q&A on WSL2
An answer key for the questionnaire at the end of the chapter

The 4th chapter of the textbook provides an overview of the training process. It provides a detailed introduction to measuring the loss, calculating the gradient, and updating the weights. It also covers some of the mechanics of the training process which includes tensor operations, activation functions, loss functions, optimizer functions, and learning rate.
We’ve spent many weeks writing the questionnaires. And the reason for that, is because we tried to think about what we wanted you to take away from each chapter. So if you read the questionnaire first, you can find out which things we think you should know before you move on, so please make sure to do the questionnaire before you move onto the next chapter.
— Jeremy Howard, Fast.ai
1. How is a grayscale image represented on a computer? How about a color image?
The Grayscale Image is an image with one channel that’s represented as a 2-dimensional matrix. It contains pixel values that represent the intensity of light for each pixel in the image where zero is a black pixel, 255 is a white pixel, and all the values in between are the different shades of gray pixels.
The Color Image is an image with three channels that are represented as a 3-dimensional matrix. It contains three 2-dimensional matrices which contain pixel values that represent the intensity of color for each pixel in the image. where each of the matrices is the different shades of red, green, and blue.
2. How are the files and folders in the MNIST_SAMPLE dataset structured? Why?
The dataset is structured using a common layout for machine learning datasets. It uses separate directories to store the training, validation, and or test sets. It also uses separate subdirectories in each of the directories to store the image files where the subdirectory names are used as the labels.
3. Explain how the “pixel similarity” approach to classifying digits works.
Pixel Similarity is an approach that’s used in machine learning to measure the similarity between two or more images. It computes the average pixel value for every pixel across all the images in each subdirectory of images. It also compares the unknown image to the average pixel values of the known images to determine how similar the image is to each of the known images.
4. What is a list comprehension? Create one now that selects odd numbers from a list and doubles them.
List Comprehension is a syntax that’s used in Python to create a list from an existing list. It creates the new list by performing an operation on each item in the existing list. It also contains three parts which include the expression, for-loop, and optional if-condition that’s declared between square brackets.
5. What is a rank-3 tensor?
Tensor Rank describes the number of dimensions in a tensor. It can have N dimensions where rank zero is a scalar with zero dimensions, rank one is a vector with one dimension, rank two is a matrix with two dimensions, and rank three is a cuboid with three dimensions. It can also be determined by the number of indices that are required to access a value within the tensor.
6. What is the difference between tensor rank and shape? How do you get the rank from the shape?
The Tensor Shape describes the length of each axis in the tensor. It contains information about the rank, axes, and indices where the number of axes identifies the rank and the length of the dimensions identifies the number of indices that are available along each axis. It also helps visualize tensors which becomes useful for higher rank tensors that are much more abstract.
7. What are RMSE and L1 norm?
The Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are loss functions that calculate the difference between the predicted values and the actual values. It would be better to use MAE, which is also known as the L1 norm, when the error is expected to scale linearly and when working with extreme values. It would also be better to use RMSE, which is also known as the L2 norm, when the error is expected to scale non-linearly.
8. How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?
The Numpy Array is a multi-dimensional matrix that’s used to perform numeric computations. It can contain any elements that are of the same data type which can be arrays of arrays. It can also run in C on the CPU which performs computations thousands of times faster than Python.
The PyTorch Tensor is a specialized data structure that’s very similar to the Numpy array but with an additional restriction that unlocks additional capabilities. It can only contain elements that are of the same data type which must be a basic numeric type. It also either runs on the CPU which performs computations thousands of times faster than Python or the GPU which performs computations up to millions of times faster than Python.
9. Create a 3×3 tensor or array containing the numbers from 1 to 9. Double it. Select the bottom-right four numbers.
10. What is broadcasting?
Broadcasting is a concept in NumPy that’s used to describe the ability to perform operations on arrays with different shapes. It provides a way to vectorize the operations so the looping occurs in C which can perform calculations 1000 times faster than Python. It also needs the shape of each dimension in the arrays to be equal or one of the dimensions must be one.
11. Are metrics generally calculated using the training set or the validation set? Why?
The model evaluation stage of the machine learning process uses metrics to evaluate the performance of the trained model using the validation set. It uses the metrics to detect overfitting and to tune the hyperparameters to improve the model’s performance. It also trains a new model with the best hyperparameters to evaluate the model’s performance using the test set.
12. What is SGD?
Stochastic Gradient Descent (SGD) is an algorithm in machine learning that’s used to find the model parameters that correspond to the best fit between the predicted values and the actual values. It calculates the gradient using random instances of the training data and updates the model parameters on each iteration which removes the computational burden associated with gradient descent. It can also adjust the model parameters in a way that moves the model out of a local minimum and towards the global minimum.
13. Why does SGD use mini-batches?
Optimization algorithms calculate the gradients using one or more data items. It can use the average of the whole dataset, but that takes a long time and may not fit into memory, or it can use a single data item, but that can be imprecise and unstable. It can also use the average of a mini-batch of a few data items which can be more accurate and stable for larger batch sizes.
14. What are the seven steps in SGD for machine learning?
Imagine being lost in the mountains with a car parked at the lowest point. It would be good to always take steps downhill which eventually leads to the destination. It would also be good to know how big of a step to take and to continue taking steps until the bottom is reached which is the parking lot.
- Initialize the Random Parameters
- Calculate the Predictions
- Calculate the Loss
- Calculate the Gradients
- Update the Weights
- Go to Step Two and Repeat the Process
- Stop When the Model is Good Enough
15. How do we initialize the weights in a model?
The first step in training the model is to initialize the parameters, which are also referred to as the weights and biases. It can be initialized using random numbers, which works most of the time, except for training neural networks with many layers, which causes exploding or vanishing gradients. It can also be initialized using special weight initialization techniques which use random numbers but ensures the gradients stay within a reasonable range.
16. What is loss?
Loss is an evaluation metric that’s used in machine learning to measure how wrong the predictions are. It calculates the distance between the predicted values and the actual values where zero represents a perfect score. It also gets calculated using one of several different loss functions that vary based on whether the model is solving a classification or a regression problem.
17. Why can’t we always use a high learning rate?
Learning Rate is a hyperparameter that’ used in machine learning to control how much to adjust the weights at each iteration of the training process. It can be too low, which takes too long to train, and makes the model more likely to get stuck in a local minimum. It can also be too high, which over-shoots the global minimum, and bounces around without ever reaching it.

18. What is a gradient?
The Gradient is a vector that’s used in machine learning to identify the direction in which the loss function produces the steepest ascent. It measures the change in all weights with regard to the change in error. It also gets used to update the weights during the training process where the product of the gradient and learning rate is subtracted from the weights.
19. Do you need to know how to calculate gradients yourself?
No, it’s not necessary to know how to manually calculate gradients. It can be calculated automatically with respect to the associated variable using the requires_grad_
method in the Tensor
class from the PyTorch
library. It also tags the variable which keeps track of every operation that’s applied to the tensor in order to perform backward propagation to calculate the gradients.
variable_name = Tensor(3.).requires_grad_()
20. Why can’t we use accuracy as a loss function?
Accuracy isn’t good to use as a loss function because it only changes when the predictions of the model change. It can improve the confidence of its predictions, but unless the predictions actually change, the accuracy will remain the same. It also produces gradients that are mostly equal to zero which prevents the parameters from updating during the training process.
21. Draw the sigmoid function. What is special about its shape?
The sigmoid function is an activation function that’s named after its shape which resembles the letter “S” when plotted. It has a smooth curve that gradually transitions from values above 0.0 to values just below 1.0. It also only goes up which makes it easier for SGD to find meaningful gradients.

22. What is the difference between a loss function and a metric?
The loss function is used to evaluate and diagnose how well the model is learning during the optimization step of the training process. It responds to small changes in confidence levels which helps to minimize the loss and monitor for things like overfitting, underfitting, and convergence. It also gets calculated for each item in the dataset, and at the end of each epoch where the loss values are all averaged and the overall mean is reported.
The metric is used to evaluate the model and perform model selection during the evaluation process after the training process. It provides an interpretation of the performance of the model that’s easier for humans to understand which helps give meaning to the performance in the context of the goals of the overall project and project stakeholders. It also gets printed at the end of each epoch which reports the performance of the model.
23. What is the function to calculate new weights using a learning rate?
The Optimizer is an optimization algorithm that’s used in machine learning to update the weights based on the gradients during the optimization step of the training process. It starts by defining some kind of loss function and ends by minimizing the loss using one of the optimization routines. It can also make the difference between getting a good accuracy in hours or days.
24. What does the DataLoader class do?
The DataLoader is a class that’s used in PyTorch to preprocess the dataset into the format that’s expected by the model. It specifies the dataset to load, randomly shuffles the dataset, creates the mini-batches, and loads the mini-batches in parallel. It also returns a dataloader object that contains tuples of tensors that represent the batches of independent and dependent variables.
25. Write pseudocode showing the basic steps taken in each epoch for SGD.
26. Create a function that, if passed two arguments [1,2,3,4] and ‘abcd’, returns [(1, ‘a’), (2, ‘b’), (3, ‘c’), (4, ‘d’)]. What is special about that output data structure?
The output is special because it has the same data structure as the Dataset object that’s used in PyTorch. It contains a list of tuples where each tuple stores an item with the associated label. It also contains all the items and labels from the first and second parameters which are paired at each index.
27. What does view do in PyTorch?
The View is a method that’s used in PyTorch to reshape the tensor without changing its contents. It doesn’t create a copy of the data which allows for efficient memory-efficient reshaping, slicing, and element-wise operations. It also shares the underlying data with the original tensor which means any changes made to the data in the view will be reflected in the original tensor.
28. What are the bias parameters in a neural network? Why do we need them?
The Bias is a parameter that’s used in machine learning to offset the output inside the model to better fit the data during the training process. It shifts the activation function to the left or right which moves the entire curve to delay or accelerate the activation. It also gets added to the product of the inputs and weights before being passed through the activation function.
parameters = sum(inputs * weights) + bias
29. What does the @ operator do in Python?
The @ is an operator that’s used in Python to perform matrix multiplication between two arrays. It performs the same operation as the matmul function from the NumPy library. It also makes matrix formulas much easier to read which makes it much easier to work with for both experts and non-experts.
np.matmul(np.matmul(np.matmul(A, B), C), D)A @ B @ C @ D
30. What does the backward method do?
Backward is a method that’s used in PyTorch to calculate the gradient of the loss. It performs the backpropagation using the backward method in the Tensor class from the PyTorch library. It also adds the gradients to any other gradients that are currently stored in the grad attribute in the tensor object.
31. Why do we have to zero the gradients?
In PyTorch, the gradients accumulate on subsequent backward passes by default. It helps train recurrent neural networks that work with time-series data where the backpropagation is repeated to perform backpropagation through time. It also must be manually set to zero for most neural networks before the backward pass is performed to update the parameters correctly.
learning_rate = 1e-5
parameters.data -= learning_rate * parameters.grad.data
parameters.grad = None
32. What information do we have to pass to Learner?
The Learner is a class that’s used in Fastai to train the model. It specifies the data loaders and model objects that are required to train the model and perform transfer learning. It can also specify the optimizer function, loss function, and other optional parameters that already have default values.
learner = Learner(dataloaders, model, loss_function, optimizer_function, metrics)
33. Show Python or pseudocode for the basic steps of a training loop.
Training is a process in machine learning that’s used to build a model that can make accurate predictions on unseen data. It involves an architecture, dataset, hyperparameters, loss function, and optimizer. It also involves splitting the dataset into training, validation, and testing data, making predictions about the data, calculating the loss, and updating the weights.
for _ in range(epochs):
prediction = model(x_batch, parameters)
loss = loss(prediction, label)
loss.backward()
for parameterin parameters:
parameter.grad.data += learning_rate * parameter.grad.data
parameter.grad.data = None
34. What is ReLU? Draw a plot of it for values from -2 to +2.
Rectified Linear Unit (ReLU) is an activation function that’s used in machine learning to address the vanishing gradient problem. It activates the input value for all the positive values and replaces all the negative values with zero. It also decreases the ability of the model to train properly when there are too many activations as zero because the gradient of zero is zero which prevents those parameters from being updated during the backward pass.

35. What is an activation function?
The Activation Function is a function that’s used in machine learning to decide whether the input is relevant or irrelevant. It gets attached to each neuron in the artificial network and determines whether to activate based on whether the input is relevant for the prediction of the model. It also helps normalize the output of each neuron to a range between -1 and 1.
output = activation_function(parameters)
36. What’s the difference between F.relu and nn.ReLU?
F.relu is a function that’s used in PyTorch to apply the rectified linear unit function to the layers in the model that’s manually defined in the class. It must be manually defined in the class of the artificial neural network where the layers and functions are defined as class attributes. It also does the same thing as the nn.ReLU class which builds the model with sequential modules.
nn.ReLU is a class that’s used in PyTorch to apply the rectified linear unit function to the layers in the model that’s defined using sequential modules. It must be used with other sequential modules which represent the layers and functions that build the artificial neural network. It also does the same thing as the F.relu function which builds the model by defining the class.
37. The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?
An artificial neural network with two layers and a nonlinear activation function can approximate any function but there are performance benefits for using more layers. It turns out that smaller matrices with more layers perform better than large matrices with fewer layers. It also means the model will train faster, use fewer parameters, and take up less memory.
“Hopefully, this article helped you get the 👯♀️🏆👯♀️, remember to subscribe to get more content 🏅”
Next Steps:
This article is part of a series that helps you set up everything you need to complete the Fast.ai course from start to finish. It contains guides that provide answers to the questionnaire at the end of each chapter from the textbook. It also contains guides that walk through the code step-by-step using definitions of terms and commands, instructions, and screenshots.
WSL2:
01. Install the Fastai Requirements
02. Fastai Course Chapter 1 Q&A
03. Fastai Course Chapter 1
04. Fastai Course Chapter 2 Q&A
05. Fastai Course Chapter 2
06. Fastai Course Chapter 3 Q&A
07. Fastai Course Chapter 3
08. Fastai Course Chapter 4 Q&A
Additional Resources:
This article is part of a series that helps you set up everything you need to start using artificial intelligence, machine learning, and deep learning. It contains expanded guides that provide definitions of terms and commands to help you learn what’s happening. It also contains condensed guides that provide instructions and screenshots to help you get the outcome faster.
Linux:
01. Install and Manage Multiple Python Versions
02. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT
03. Install the Jupyter Notebook Server
04. Install Virtual Environments in Jupyter Notebook
05. Install the Python Environment for AI and Machine LearningWSL2:
01. Install Windows Subsystem for Linux 2
02. Install and Manage Multiple Python Versions
03. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT
04. Install the Jupyter Notebook Server
05. Install Virtual Environments in Jupyter Notebook
06. Install the Python Environment for AI and Machine Learning
07. Install Ubuntu Desktop With a Graphical User Interface (Bonus)Windows 10:
01. Install and Manage Multiple Python Versions
02. Install the NVIDIA CUDA Driver, Toolkit, cuDNN, and TensorRT
03. Install the Jupyter Notebook Server
04. Install Virtual Environments in Jupyter Notebook
05. Install the Python Environment for AI and Machine LearningMac:
01. Install and Manage Multiple Python Versions
02. Install the Jupyter Notebook Server
03. Install Virtual Environments in Jupyter Notebook
04. Install the Python Environment for AI and Machine Learning
Glossary:
Mean Absolute Error (MAE) is a loss function that’s used to measure the performance of the model. It computes the average of the absolute value of the differences between the predicted values and the actual values. It also should produce similar scores for the training and test sets where lower scores indicate a better fit and larger gaps in the scores indicate overfitting.

[Return]
Root Mean Square Error (RMSE) is a loss function that’s used to measure the performance of the model. It computes the square root of the average of the squared differences between the predicted values and the actual values. It also should have similar scores for the training and test sets where a lower score indicates a better fit and larger gaps in the scores indicate overfitting.
[Return]

Classification Accuracy (Accuracy) is an evaluation metric that’s used in machine learning to measure how often the model is correct. It can be calculated by dividing the number of correct predictions, which includes true positives and true negatives, by the total number of predictions, which includes true positives, true negatives, false positives, and false negatives.
[Return]
Loss Function is a function that’s used in machine learning to evaluate how well the model is performing. It calculates the loss which changes as the parameters are adjusted to produce a slightly better loss when the model makes slightly better predictions. It also gets used to calculate the gradient which is necessary to update the parameters during the training process.
[Return]
Sigmoid is an activation function that’s used to predict probability in binary and multi-label classification problems. It converts input values into outputs between 0.0 and 1.0 where big numbers become 1.0 and negative numbers become 0.0. It also predicts each probability separately with high accuracy on non-mutually exclusive outputs but it can cause vanishing gradients.
[Return]