KServe is a powerful tool in the realm of model serving, particularly known for its integration with Kubernetes. It facilitates the deployment and scaling of machine learning models, allowing them to be efficiently served in production environments. KServe's compatibility with Kubernetes ensures robust scalability and management, making it a go-to choice for many machine learning engineers.
After you have a foundational understanding of model serving with KServe, you need to explore what Ray Serve offers and its integration with KServe for parallel inferencing. Ray Serve stands out as a scalable and programmable serving framework built atop Ray. Its framework-agnostic nature means it can serve models built with any deep learning framework, such as PyTorch, TensorFlow, or SKlearn. Moreover, Ray Serve enables the construction of complex inference services, which can include multiple machine learning models and business logic, all orchestrated via Python code.
Ray Serve Architecture
Delving deeper, Ray Serve's architecture is intrinsically linked to Ray. It operates within a Ray cluster consisting of a head node and several worker nodes. In this setup, Ray Serve functions as an instance of a Ray Cluster with additional ray actors:
HTTP proxy actor: Manages incoming requests and forwards them to replicas.
Replica: Each replica processes individual requests from the HTTP proxy and responds once they are completed.
Controller actor: Oversees the management of other actors.
Parallel Inferencing with KServe and Ray Serve
By default, KServe's custom model serving runtime handles model loading and prediction within the same process as the HTTP Server. The integration of Ray Serve changes this paradigm. After you enable Ray Serve, KServe launches a Ray Serve instance, leading to a significant change in operation:
Models are deployed to Ray Serve as replicas, allowing for parallel inferencing when serving multiple requests.
Implementing Ray Serve with KServe
To enable Ray Serve on KServe, the process involves a few straightforward steps.
Create a KServe custom model
First we need to create a custom serving runtime with two handler methods using the KServe API. The kserve.Model base class requires at least a load handler and predict handler to implement custom model serving runtime.
def load() is a load handler for loading a model into memory. This is generally called within the init().
def predict() is a prediction handler for implementing the logic to return a result.
The folowing python example shows each handler's implementation for serving a custom model.
You can save the example code in a file and run it as a Python program to serve the custom model for inferencing. The example is also available for download from this GitHub Gist.
The following code shows how to modify the original example to enable Ray Serve with KServe in 3 steps:
Import Ray Serve into your environment.
Define the Ray Serve application using a decorator and specify the number of replicas for scaling.
Start the KServe model server with a different parameter to incorporate Ray Serve.
You can save the example in a file and run as a Python program to serve the custom model with Ray Serve. The example is available for download from this GitHub Gist.
The integration of KServe with Ray Serve offers a robust solution for scalable model serving, which is particularly beneficial for environments requiring parallel inferencing. This combination leverages the strengths of both platforms, providing a flexible, scalable, and efficient approach to serving machine learning models in production.
We recently presented on this topic at Ray Summit 23, which you can watch in the following video playback.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.