In today's rapid ML development lifecycle, the ability to process and analyze vast amounts of data quickly and efficiently is paramount. People might typically develop light workloads on their laptop and then struggle when they need to scale out the workloads. Tackling large-scale computation with distributed systems usually comes with complexity.
Ray is an open source and unified framework for running distributed Python workloads. You can start by developing your Python code locally on your laptop, taking full advantage of its resources. But when your workload demands more computational power, Ray seamlessly extends your capabilities by leveraging a remote Ray cluster. Explore more about using Ray in the Getting Started doc.
Built on top of Ray, CodeFlare is designed to simplify the integration, scaling, and acceleration of large-scale workloads on Red Hat OpenShift in the cloud. CodeFlare provides a user-friendly interface that abstracts away the complexities of infrastructure management, making cluster resources accessible to both novice and experienced data scientists.
In this tutorial, you'll learn how to create a Ray cluster using the CodeFlare SDK and interact with Ray within the CodeFlare Stack.
Python runtime environment. See the instructions in the Pyenv repo.
Steps
Step 1: Run the training example locally.
Step 2: Create a Ray cluster using the CodeFlare SDK.
Step 3: Interact with the cluster.
Run the training example locally
Let's use the training script in the Ray documentation. The script is an example of distributed training using the Ray Train python library.
# save as ray_torch_quickstart.pyimport torch
import torch.nn as nn
import ray
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig
# If using GPUs, set this to True.
use_gpu = False
input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3classNeuralNetwork(nn.Module):
def__init__(self):
super(NeuralNetwork, self).__init__()
self.layer1 = nn.Linear(input_size, layer_size)
self.relu = nn.ReLU()
self.layer2 = nn.Linear(layer_size, output_size)
defforward(self, input):
returnself.layer2(self.relu(self.layer1(input)))
deftrain_loop_per_worker():
dataset_shard = session.get_dataset_shard("train")
model = NeuralNetwork()
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
model = train.torch.prepare_model(model)
for epoch inrange(num_epochs):
for batches in dataset_shard.iter_torch_batches(
batch_size=32, dtypes=torch.float
):
inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"]
output = model(inputs)
loss = loss_fn(output, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"epoch: {epoch}, loss: {loss.item()}")
session.report(
{},
checkpoint=Checkpoint.from_dict(
dict(epoch=epoch, model=model.state_dict())
),
)
train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x inrange(200)])
scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu)
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
scaling_config=scaling_config,
datasets={"train": train_dataset},
)
result = trainer.fit()
Show more
Next, we need to verify that we can run this example locally. First, install the Ray packages using the following commands in terminal.
mkdir example; cd example
pyenv install 3.8.13
pyenv virtualenv 3.8.13 localinter
pyenv local localinter
pip install "ray[default]" torch pandas pyarrow
# Assume the above script is saved as ray_torch_quickstart.py
python ray_torch_quickstart.py
Show more
The example automatically starts a local instance of Ray and runs the traning job. When the job finishes, it should return a status message similar to the following:
Trial TorchTrainer_ec4d0_00000 completed.
== Status ==
Current time: 2023-06-20 17:02:45 (running for 00:00:11.98)
Using FIFO scheduling algorithm.
Logical resource usage: 4.0/8 CPUs, 0/0 GPUs
Result logdir: /Users/tedchang/ray_results/TorchTrainer_2023-06-20_17-02-33
Show more
Create a Ray cluster using the CodeFlare SDK
If you do not already have a Ray cluster running on your CodeFlare, you can create one using the CodeFlare SDK (codeflare-sdk).
You must have already issued oc login into your cluster.
# This creates a Ray cluster with 1 head and 2 worker nodes in the CodeFlare.
pip install codeflare-sdk
python - << EOF
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
cluster = Cluster(ClusterConfiguration(namespace="default", name="torch", min_worker=1, max_worker=2, min_cpus=2, max_cpus=4, min_memory=4, max_memory=8, gpu=0, instascale=False))
cluster.up()
cluster.wait_ready()
cluster.details()
EOF
Show more
The output should include remote Ray dashboard URL.
Alternatively, you can get the remote Ray dashboard URL from the Codeflare cluster this way:
oc get route -n default|grep ray-dashboard-torch|awk '{print $2}'# Take note of the Ray dashboard url. It should something like:# ray-dashboard-torch-default.apps.tedbig412.cp.fyre.ibm.com
Show more
Interact with the remote cluster
The Ray Jobs CLI lets you interact with the remote Ray cluster. For example, you can submit the training script as a job to the Ray cluster via the Ray Dashboard URL.
exportRAY_ADDRESS=http://`oc get route -n default|grep ray-dashboard-torch|awk '{print $2}'`
# The ray job command line tool comes with Ray python package which we installed in earlier.
ray job submit --working-dir . --no-wait -- python ray_torch_quickstart.py
Show more
You can expect output to be similar to the following output:
-------------------------------------------------------
Job 'raysubmit_ebUGqdNDVAcvFZ8U' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job: ray job logs raysubmit_3u1wFAuV2x4Lx2eT Query the status of the job: ray job status raysubmit_3u1wFAuV2x4Lx2eT Request the job to be stopped: ray job stop raysubmit_3u1wFAuV2x4Lx2eT
Show more
You can interact with Ray using the commands presented in the output. For example, ray job status raysubmit_3u1wFAuV2x4Lx2eT should eventually return something similar to the following:
Job submission server address: http://ray-dashboard-torch-default.apps.tedbig412.cp.fyre.ibm.com
------------------------------------------
Job 'raysubmit_3u1wFAuV2x4Lx2eT' succeeded
------------------------------------------
The tutorial showed you how to run a distributed model training workload with the open source solutions, Ray and CodeFlare. Model training is a key part of machine learning lifecycle.
If you require a robust platform to train your ML workflows, you should check out watsonx.ai. Watsonx.ai brings together new generative AI capabilities, powered by foundation models, and traditional machine learning into a powerful platform spanning the AI lifecycle. With watsonx.ai, you can train, validate, tune, and deploy models with ease and build AI applications in a fraction of the time with a fraction of the data.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.