AI model training with PyTorch

The field of generative AI continues to grow exponentially and holds substantial transformative potential for the enterprise. The watsonx platform harnesses this growth to accelerate the AI lifecycle in all phases. More specifically, watsonx.ai handles the training, validation, tuning, and deployment of machine learning (ML) models, with ease, in a secure studio environment.

There are several projects that build the open source stack of watsonx, one of which is PyTorch. At IBM, we use PyTorch to train foundation models, which are large-scale ML models trained on a vast and broad set of unlabeled data. Foundation models can perform many different functions and look to replace the task-specific models that have been the focus of the machine learning landscape until recently.

In this article, we'll discuss the basics of PyTorch and it's support for the training, evaluation, and inferencing phases of the machine learning workflow. For a deeper dive and to play with an example of how to build, train, and evaluate an ML model, refer to this companion guided project.

What is PyTorch?

PyTorch is an open source framework for AI research and commercial production in machine learning. It is used to build, train, and optimize deep learning neural networks for applications such as image recognition, natural language processing, and speech recognition. It provides computation support for CPU, GPU, parallel and distributed training on multiple GPUs and multiple nodes. PyTorch is Pythonic, making it easy for data scientists and developers to build and debug complex machine learning workflows.

PyTorch is also flexible and easily extensible, with specific libraries and tools available for many different domains. All of the above have made PyTorch a leading framework in machine learning. PyTorch has grown tremendously since its initial release, but two features are still ultimately at its core: tensors and an automatic differentiation library called autograd. We'll introduce tensors and their importance to neural networks then cover the machine learning phases, and in particular how autograd is used to improve the model training process.

For the purposes of this article, the words "neural network" and "model" can be used interchangeably. In other machine learning domains, however, models may refer to something other than neural networks.

Tensors are a core data structure used in PyTorch and other deep learning projects and can be thought of as multidimensional arrays. They are used to encode the inputs and outputs of a model as well as all the model parameters in between. Unlike similar data structures in other frameworks, operations on tensors can be accelerated by use of GPUs or other hardware. There are over one hundred ready to use built-in operations that can be applied to tensors from simple addition to more complicated operations such as matrix multiplication. In addition to simply storing values, tensors also store a good deal of metadata in order to inform all phases of model development. Tensor metadata attributes include the shape of the tensor, the data type of the values stored within, the device on which the tensor is stored, and the layout of the tensor in memory. A tensor's attributes optimize it for use with autograd, PyTorch's automatic differentiation engine, which underpins the process by which a neural network learns. (The tensors tutorial in the PyTorch docs provides a more in-depth look at tensors.)

The PyTorch community is extensive and committed to growing the project and to delivering state-of-the-art features as the world of AI continues to expand, making PyTorch a leading framework for ML models of all sizes.

Model training

A neural network can be thought of as a sequence of nested functions, with each function loosely representing a layer of the network. Each layer accepts input in the form of a tensor, modifies that input according to its rules and the values stored in its parameter tensor, and passes the resulting output tensor to the next layer.

The following example shows how a neural network is defined in PyTorch. The layers of the network, including their input and output sizes, are defined in the __init__ method, and the forward function takes care of applying the defined layers step-by-step to the input tensor x and returning a prediction.

import torch
import torch.nn as nn

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(100, 200)
        self.activation = nn.ReLU()
        self.linear2 = nn.Linear(200, 10)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, input_data):
        x = self.linear1(input_data)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

The network is then instantiated, and a prediction can be made by supplying input data to the model object. Note that the forward method is called implicitly.

model = NeuralNetwork()

input_tensor = torch.randn(10, 100)
output = model(input_data=input_tensor)  # use of keyword arguments, as shown here, is not required

For the purposes of this simple example, we are using random tensor values as input, but the training data used in a real life scenario is naturally much more complex. PyTorch provides two data primitives for working with data: DataLoader and Dataset. The Dataset class stores the input and label data, which can be either pre-loaded from a libary of well-known datasets or a custom dataset of your own. The DataLoader class enables easy access to the Dataset by wrapping it with an iterable. The companion guided project provides an in-depth look at the use of these primitives as it relates to model training.

Two phases of model training: the forward pass and the backward pass

The last line in the above snippet represents what is called a forward pass in model training. There are two primary phases of model training: the forward pass and the backward pass. These two phases are repeated iteratively until the model is considered sufficiently trained. In the forward pass, the model makes a prediction for inputs based on the current values of its parameters. If the model has not yet been trained, this prediction is likely to be off from the expected value for that input. The degree to which a prediction is incorrect can be calculated using a loss function, as shown in the following snippet. There are many different types of loss. This example uses a mean squared error loss. The loss function is instantiated only once, at the beginning of the training process.

loss_func = torch.nn.MSELoss()

In each training loop, the predicted output given by the forward pass is compared to the target output by supplying both values to the previously defined loss function and returning a numerical value that describes their degree of dissimilarity.

target = torch.randn(10, 10) # dummy target for example purposes with same shape as output
loss = loss_func(output, target)

In the backward pass, this calculated loss is used to adjust the model parameters closer to the desired values. To do so, the network is traversed backwards and the model parameters adjusted by calculating the gradient of the loss with respect to the model parameters. This is a very complex and often computationally intense operation that is handled by PyTorch's autograd differentiation engine, requiring only a single line call to the backward() method. This call will trigger autograd to calculate the model parameter gradients and propagate them backwards layer-by-layer. The calculated values are stored in each parameter tensor's grad attribute.

loss.backward()

Without autograd, the model developer would have to calculate and store a series of partial derivatives for every operation that occurred on every parameter tensor in all the network's layers. This would be a huge undertaking considering that many foundation models are comprised of several billions parameters. Autograd is able to complete this process quickly because it keeps a record of the tensors and the operations performed on them in the form of a computational graph. Each intermediate tensor stores in its grad_fn metadata attribute the operation that occurred on it in the forward pass, linking it to the previous layer.

In the backward pass, autograd uses this graph to traverse the network in reverse, using the stored grad_fn to look up the correct partial derivative formula for that operation and finally calculating that gradient. This graph is constructed anew during every forward pass, meaning the graph is dynamic. This is a unique feature of PyTorch that provides a great deal of flexibility, as model architecture can be changed during training without restarting the training process.

The backward pass is also frequently referred to as backpropagation.

Optimizing model parameters

The final step for this initial model training exercise is to apply these stored gradients to the model parameters. This step is done by an optimizer. A common choice for the optimizer is the stochastic gradient descent optimizer, but there are many other optimizers to choose from. An optimizer is initialized once at the beginning of the training process, registering the model's parameters that need to be trained and passing in any necessary parameters.

optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

Then in each training loop, the step() method is called to apply the gradients calculated by autograd to the model parameters, and a call to zero_grad() is made to set the stored gradient to zero for next iteration.

optimizer.step() # gradient descent
optimizer.zero_grad()

At this point, one iteration of the model training process is complete. As mentioned earlier, this process is repeated many times, each time nudging the model's parameters, and hence its accuracy, towards the optimal state. At any time during training, the model can be saved in case of failure or to keep a record of the model at certain checkpoints.

Model evaluation

After completing a training loop, an evaluation loop is run to assess the accuracy of the model. To evaluate a model, novel data is fed through the network in order to receive a prediction, and the prediction is compared to the expected result using the same loss metrics as those used during the training loop. The model parameters remain unchanged during evaluation, meaning that no backwards pass occurs. PyTorch's torch.no_grad() context manager is used in this case to indicate to autograd that gradients should not be calculated, reducing unnecessary use of system resources. (Evaluation might also be referred to as testing.)

Once a model has been sufficiently trained and evaluated with the test data, it is ready for inference.

Model inference

AI inferencing is the process by which a model makes predictions on novel data and is often the end goal of training a model. Model inference is the stage at which the model is ready to be used. Model inferencing is used frequently in support of many daily tasks, including interacting with a chatbot, performing an image search, or using a speech-to-text function on a cellphone.

Model inferencing works in a similar way to the forward pass of the training phase: data is fed through the model in order to produce a prediction. In the inference phase, autograd is disabled and model parameters are not updated.

Summary and next steps

Although we've only scratched the surface of the capabilities of Pytorch, it is clear that it is a robust and comprehensive framework that supports many phases of the machine learning workflow. And with a large, active community of open source developers dedicated to its continual improvement, it is no wonder that PyTorch is a leading deep learning framework and one of the components of the watsonx.ai hybrid cloud-native stack.

The watsonx platform is built for enterprise AI, bringing foundation models to enterprises with trust and governance. Check out the watsonx.ai, next-generation studio for AI builders. Explore more articles and tutorials about watsonx.