Run a distributed model training workload using Ray and CodeFlare

In today's rapid ML development lifecycle, the ability to process and analyze vast amounts of data quickly and efficiently is paramount. People might typically develop light workloads on their laptop and then struggle when they need to scale out the workloads. Tackling large-scale computation with distributed systems usually comes with complexity.

Ray is an open source and unified framework for running distributed Python workloads. You can start by developing your Python code locally on your laptop, taking full advantage of its resources. But when your workload demands more computational power, Ray seamlessly extends your capabilities by leveraging a remote Ray cluster. Explore more about using Ray in the Getting Started doc.

Built on top of Ray, CodeFlare is designed to simplify the integration, scaling, and acceleration of large-scale workloads on Red Hat OpenShift in the cloud. CodeFlare provides a user-friendly interface that abstracts away the complexities of infrastructure management, making cluster resources accessible to both novice and experienced data scientists.

In this tutorial, you'll learn how to create a Ray cluster using the CodeFlare SDK and interact with Ray within the CodeFlare Stack.

Prerequisites

CodeFlare on a Red Hat Openshift cluster.
Python runtime environment. See the instructions in the Pyenv repo.

Steps

Step 1: Run the training example locally.
Step 2: Create a Ray cluster using the CodeFlare SDK.
Step 3: Interact with the cluster.

Run the training example locally

Let's use the training script in the Ray documentation. The script is an example of distributed training using the Ray Train python library.

# save as ray_torch_quickstart.py
import torch
import torch.nn as nn

import ray
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

# If using GPUs, set this to True.
use_gpu = False

input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

def train_loop_per_worker():
    dataset_shard = session.get_dataset_shard("train")
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    model = train.torch.prepare_model(model)

    for epoch in range(num_epochs):
        for batches in dataset_shard.iter_torch_batches(
            batch_size=32, dtypes=torch.float
        ):
            inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"]
            output = model(inputs)
            loss = loss_fn(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print(f"epoch: {epoch}, loss: {loss.item()}")

        session.report(
            {},
            checkpoint=Checkpoint.from_dict(
                dict(epoch=epoch, model=model.state_dict())
            ),
        )


train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)])
scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu)
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()

Next, we need to verify that we can run this example locally. First, install the Ray packages using the following commands in terminal.

mkdir example; cd example
pyenv install 3.8.13
pyenv virtualenv 3.8.13 localinter
pyenv local localinter
pip install "ray[default]" torch pandas pyarrow
# Assume the above script is saved as ray_torch_quickstart.py
python ray_torch_quickstart.py

The example automatically starts a local instance of Ray and runs the traning job. When the job finishes, it should return a status message similar to the following:

Trial TorchTrainer_ec4d0_00000 completed.
== Status ==
Current time: 2023-06-20 17:02:45 (running for 00:00:11.98)
Using FIFO scheduling algorithm.
Logical resource usage: 4.0/8 CPUs, 0/0 GPUs
Result logdir: /Users/tedchang/ray_results/TorchTrainer_2023-06-20_17-02-33

Create a Ray cluster using the CodeFlare SDK

If you do not already have a Ray cluster running on your CodeFlare, you can create one using the CodeFlare SDK (codeflare-sdk).

You must have already issued oc login into your cluster.

# This creates a Ray cluster with 1 head and 2 worker nodes in the CodeFlare.
pip install codeflare-sdk
python - << EOF
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
cluster = Cluster(ClusterConfiguration(namespace="default", name="torch", min_worker=1, max_worker=2, min_cpus=2, max_cpus=4, min_memory=4, max_memory=8, gpu=0, instascale=False))
cluster.up()
cluster.wait_ready()
cluster.details()
EOF

The output should include remote Ray dashboard URL.

Requested cluster up and running!
                  🚀 CodeFlare Cluster Details 🚀

 ╭──────────────────────────────────────────────────────────────╮
 │   Name                                                       │
 │   torch                                        Active ✅     │
 │                                                              │
 │   URI: ray://torch-head-svc.default.svc:10001                │
 │                                                              │
 │   Dashboard🔗                                                │
 │                                                              │
 │                      Cluster Resources                       │
 │   ╭─ Workers ──╮  ╭───────── Worker specs(each) ─────────╮   │
 │   │  Min  Max  │  │  Memory      CPU         GPU         │   │
 │   │            │  │                                      │   │
 │   │  1    2    │  │  4~8         2           0           │   │
 │   │            │  │                                      │   │
 │   ╰────────────╯  ╰──────────────────────────────────────╯   │
 ╰──────────────────────────────────────────────────────────────╯

Alternatively, you can get the remote Ray dashboard URL from the Codeflare cluster this way:

oc get route -n default|grep ray-dashboard-torch|awk '{print $2}'
# Take note of the Ray dashboard url. It should something like:
# ray-dashboard-torch-default.apps.tedbig412.cp.fyre.ibm.com

Interact with the remote cluster

The Ray Jobs CLI lets you interact with the remote Ray cluster. For example, you can submit the training script as a job to the Ray cluster via the Ray Dashboard URL.

export RAY_ADDRESS=http://`oc get route -n default|grep ray-dashboard-torch|awk '{print $2}'`
# The ray job command line tool comes with Ray python package which we installed in earlier.
ray job submit --working-dir . --no-wait -- python ray_torch_quickstart.py

You can expect output to be similar to the following output:

-------------------------------------------------------
Job 'raysubmit_ebUGqdNDVAcvFZ8U' submitted successfully
-------------------------------------------------------
Next steps
  Query the logs of the job:
    ray job logs raysubmit_3u1wFAuV2x4Lx2eT
  Query the status of the job:
    ray job status raysubmit_3u1wFAuV2x4Lx2eT
  Request the job to be stopped:
    ray job stop raysubmit_3u1wFAuV2x4Lx2eT

You can interact with Ray using the commands presented in the output. For example, ray job status raysubmit_3u1wFAuV2x4Lx2eT should eventually return something similar to the following:

Job submission server address: http://ray-dashboard-torch-default.apps.tedbig412.cp.fyre.ibm.com

------------------------------------------
Job 'raysubmit_3u1wFAuV2x4Lx2eT' succeeded
------------------------------------------

You can review more CodeFlare examples in the codeflare-sdk demo notebooks.

Summary and next steps

The tutorial showed you how to run a distributed model training workload with the open source solutions, Ray and CodeFlare. Model training is a key part of machine learning lifecycle.

If you require a robust platform to train your ML workflows, you should check out watsonx.ai. Watsonx.ai brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune, and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data.

Try watsonx.ai - next-generation studio for AI builders. Explore more articles and tutorials about watsonx on IBM Developer.