In this article, I briefly described what MLflow is and how it works. MLflow currently provides APIs in Python that you can invoke in your machine learning source code to log parameters, metrics, and artifacts to be tracked by the MLflow tracking server.

If you’re familiar with and perform machine learning operations in R, you might like to track your models and every run with MLflow. There are several approaches that you can use:

  • Wait for MLflow to release the APIs in R
  • Wrap MLflow RESTful APIs and log through curl commands
  • Call existing Python APIs with some R packages that can invoke the Python interpreter

The last approach is simple and easy, while allowing you to interact with MLflow without waiting for R APIs to be available. In this tutorial, I explain how to do this with the reticulate R package.

reticulate is an open source R package that lets you call Python from R by embedding a Python session within the R session. The package provides seamless and high-performance interoperability between R and Python. The package is available in the CRAN repository.

MLflow also comes with a Projects component that packs data, source code with commands, parameters, and the execution environment setup together as a self-contained specification. After an MLproject is defined, you can run it everywhere. Currently, MLproject can run Python code or a shell command. It can also set up the Python environment for the project specified in the conda.yaml file defined by users.

For R users, it’s common to load some packages in the R source codes. These packages must be installed for the R code to run. In the future, it could be a good enhancement for MLflow to add something similar to conda.yaml to set up R package dependencies. This tutorial explains how to create an MLproject containing R source code and run it with the mlflow run command.

Learning objectives

In this tutorial, you will install and set up the MLflow environment, train and track machine learning models in R, package source codes and data in an MLproject, and run it with the mlflow run command.

Prerequisites

Before beginning this tutorial, Python should be installed on the platform where R is running. I prefer installing miniconda. Because the machine learning training will be done in R, R should be already installed on the platform as well.

Estimated time

Completing this tutorial should take approximately 30 minutes.

Steps

Step 1: Install MLflow

Create a virtualenv for MLflow and install the mlflow package as follows (with conda):

conda create -q -n mlflow python=3.6
source activate mlflow
pip install -U pip
pip install mlflow

Step 2: Install reticulate R package

Install the reticulate package through R.

install.packages("reticulate")

reticulate allows R to call Python functions seamlessly. The Python package is loaded by the import statement. Calling to a function is through the $ operator.

> library(reticulate)
> path <- import("os.path")
> path$isdir("/tmp")
[1] TRUE

As you can see, it’s simple to call Python functions in the os.path module from R with this package. You can do the same thing with the mlflow package by importing it and then calling mlflow$log_param and mlflow$log_metric to log parameters and metrics for the R script.

Step 3: Train a GLM model with SparkR

The following R script builds a linear regression model with SparkR. You must have the SparkR package installed for this example.

# load the reticulate package and import mlflow Python module
library(reticulate)
mlflow <- import("mlflow")

# load SparkR package and start spark session
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master="local[*]")

# convert iris data.frame to SparkDataFrame
df <- as.DataFrame(iris)

# parameter for GLM
family <- c("gaussian")

# log the parameter
mlflow$log_param("family", family)

# fit the GLM model
model <- spark.glm(df, Species ~ ., family = family)

# exam the model
summary(model)

# path to save the model
model_path <- "/tmp/mlflow-GLM"

# save the model
write.ml(model, model_path)

# log the artifact
mlflow$log_artifacts(model_path)

# stop spark session
sparkR.session.stop()

You can either copy the script to R or Rstudio and run it interactively, or save it to a file and run it with the Rscript command. Make sure that the PATH environment variable includes the path to the mlflow Python virtualenv.

Step 4: Launch the MLflow UI

Launch the MLflow UI by running the mlflow ui command from a shell. Then, open a browser and go to the page link using the URL http://127.0.0.1:5000. Your previous GLM model training is now showing and can be tracked. The following image shows a snapshot of this.

*MLflow* UI snapshot

Step 5: Train a decision tree model

  1. Download to your platform the wine-quality.csv data to be learned.

  2. Install the rpart package on your R environment:

     install.packages("rpart")
    
  3. Follow this example rpart-example.R to fit a tree model:

     # Source prep.R file to install the dependencies
     source("prep.R")
    
     # Import mlflow python package for tracking
     library(reticulate)
     mlflow <- import("mlflow")
    
     # Load rpart to build a tree model
     library(rpart)
    
     # Read in data
     wine <- read.csv("wine-quality.csv")
    
     # Build the model
     fit <- rpart(quality ~ ., wine)
    
     # Save the model that can be loaded later
     saveRDS(fit, "fit.rpart")
    
     # Save the model to mlflow tracking server
     mlflow$log_artifact("fit.rpart")
    
     # Plot
     jpeg("rplot.jpg")
     par(xpd=TRUE)
     plot(fit)
     text(fit, use.n=TRUE)
     dev.off()
    
     # Save the plot to mlflow tracking server
     mlflow$log_artifact("rplot.jpg")
    

The R code includes three parts: the model training, the artifacts logging through MLflow, and the R package dependencies installation.

Step 6: Prepare package dependencies for MLproject

In the previous example, the reticulate and rpart R packages are required for the code to run. To pack these codes into a self-contained project, some sort of script should be run to automatically install the packages if the platform does not have them installed.

Any specific R package needed for the project is going to be installed through prep.R with these codes:

# Accept parameters, args[6] is the R package repo url
args <- commandArgs()

# All installed packages
pkgs <- installed.packages()

# List of required packages for this project
reqs <- c("reticulate", "rpart")

# Try to install the dependencies if not installed
sapply(reqs, function(x){
  if (!x %in% rownames(pkgs)) {
    install.packages(x, repos=c(args[6]))
  }
})

Step 7: Test your codes

Before packaging these into an MLproject, try to test by directly invoking the Rscript command as follows:

Rscript rpart-example.R https://cran.r-project.org/

From the MLflow UI, you should see that this run been tracked, like this image:

snapshot

Step 8: Create an MLproject

Now, let’s write the spec and pack this project into an MLproject that MLflow knows to run. All you need to do is create the MLproject file in the same directory.

name: r_example

entry_points:
    main:
        parameters:
            r-repo: {type: string, default: "https://cran.r-project.org/"}
        command: "Rscript rpart-example.R {r-repo}"

This file defines an r_example project with a main entry point. The entry point specifies the command and parameters to be executed by the mlflow run. For this project, Rscript is the shell command to invoke the R source code. The r-repo parameter provides the URL string where the dependent packages can be installed from. A default value is set. This parameter is passed to the command running the R source code.

Now that you have all the files required to train this tree model, you can create an MLproject by creating a directory and copying the data and R source codes to that directory.

.
bbb R
    bbb MLproject
    bbb prep.R
    bbb rpart-example.R
    bbb wine-quality.csv

Step 9: Check in and test the MLproject

The previous MLproject can be checked in and pushed to the GitHub repository. Use the following command to test the project. It can be run on any platform that has R installed.

mlflow run https://github.com/adrian555/DocsDump#files/mlflow-projects/R

The project can also be viewed from the MLflow tracking UI like this iamge:

snapshot-project

The differences between this view and the previous run without the Mlproject spec are the Run Command that captures the exact command to run the project and the Parameters, which automatically logs any parameters passed to entry points.

Summary

In this tutorial, you have successfully created an MLproject in R and tracked and ran it with MLflow. This approach lets R users take advantage of the MLflow Tracking component so that you can track your R models quickly. It also demonstrated what the Projects component of MLflow is designed for – to define the project and make it easily to be rerun. R users can quickly set up their projects and enjoy the ease of tracking and running projects with MLflow.