The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering, and model selection, while important, is just one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.
This is particularly challenging in the case of deploying Apache Spark ML pipelines for low latency scoring. While MLlib’s DataFrame API is powerful, elegant, and works well in batch scoring scenarios, it is relatively ill-suited to the needs of many real-time predictive applications, for two main reasons.
First, many real-time applications have low latency requirements that cannot be met with the current DataFrame-based execution model, as Spark’s SQL engine introduces significant overheads, including query planning and job scheduling, even when running on a single node. Second, live scoring environments are typically separate from Spark clusters, and requiring the live environment to bring in all of Spark and its dependencies is overly complex and prone to introducing conflicts, errors, and version management issues.
Currently, for users to load trained models outside of Spark, they must either:
- Write a custom reader for Spark’s native format
- Create their own custom format
- Export to a standard format (not currently supported in Spark ML, hence requiring a custom-built export library)
Furthermore, to actually score Spark models outside of Spark, users are forced to either re-implement scoring algorithms or create a custom translation layer between Spark ML and another ML library. Taken together, these issues represent a major pain point for Spark ML users wishing to deploy their models to production.
Open standards for model deployment
Recently, a new standard for exporting and executing analytic applications has emerged, the Portable Format for Analytics (PFA). PFA is being championed by the Data Mining Group and is the more powerful and flexible successor to the relatively widely adopted Predictive Model Markup Language (PMML), which the DMG itself previously created.
A PFA “document” is a JSON file that specifies the input and output data schema (using Apache Avro types), state, and the transformation (or set of functions) applied to the input to return the output. It can be thought of as a mini functional language together with a data schema specification.
Using an open standard for deploying models means the model producer (for example, Spark or another ML framework) and the model consumer (for example, the scoring environment) can be completely independent. In the case of Spark ML models, standardization means one scoring engine can be used for deploying models not only from various Spark environments and versions, but from other runtimes and frameworks (for example, scikit-learn, R, and others). Independence means model scoring is fast (because it does not execute via Spark DataFrames with associated higher latency) and less complex (because it does not depend on the Spark cluster runtime).
The appeal of PFA is that the format specifies both serialization as well as execution of the data transformation or machine learning model. That is, a PFA document is fully self-contained and can be executed by any compliant execution engine, making a model written to PFA truly portable across languages, frameworks, and runtimes.
A simple example
To illustrate the elements of a PFA document, I will use a simple multi-class logistic regression model. The following images show the input and output schemas, together with the action that is applied to the input.
You can see that the output (the predicted class from the model) is the result of a set of built-in mathematical functions applied to the input (a feature vector in the form of an array). These functions are similar to functions one might call in Numpy if using Python code, for example.
Introducing Aardpfark: A library for exporting Spark ML models to PFA
Today, the CODAIT team is excited to announce the release of Aardpfark, an open source library for exporting Spark ML models and pipelines to PFA. While reference implementations for authoring PFA documents exist for Python and R, there were none for Scala. The
aardpfark-core module provides a generic Scala DSL for authoring PFA documents. The
aardpfark-spark module uses the core DSL to create exporters for Spark ML components, as well as pipelines consisting of supported components.
To give an idea of relative performance, scoring a typical pipeline consisting of 47 string indexer transformers and 27 raw numerical columns, with a linear regression model, takes 1.9 seconds per record using Spark ML. The same pipeline scored with PFA executes in 1 millisecond per record, many orders of magnitude faster!
Aardpfark is still in the early phases of development, though it already supports many Spark ML predictors and transformers. In the near future, we plan to complete coverage of Spark ML components, run more detailed performance testing, round out the Scala DSL, add Python support, fix outstanding issues, and work on publishing an initial release. We would love to hear your feedback and ideas for further developing the project. We also welcome contributions from the community to help develop the roadmap mentioned previously! For more information, check out Aardpfark on Github. CODAIT’s Nick Pentreath will also be giving a talk on Spark ML and PFA at Spark Summit today, June 5th.