Exporting Spark ML Models to the Portable Format for Analytics
The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering, and model selection, while important, is just one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.
The Portable Format for Analytics (PFA) is an emerging open standard for exporting and executing analytic applications, in particular machine learning models and pipelines. A key benefit of PFA is that the format specifies both serialization as well as execution of the model. This makes models exported to PFA independent of the model producing application and truly portable across languages, frameworks, and runtimes.
Aardpfark is a Scala library and domain-specific language (DSL) for easily creating PFA models. It is initially aimed at exporting Apache Spark ML models and pipelines to PFA.
Deploying Spark ML pipelines for low-latency scoring is currently particularly challenging. While Spark’s MLlib machine learning library is powerful, elegant, and works well in batch scoring scenarios, it is relatively ill-suited to the needs of many real-time predictive applications, as the dependence on the Spark SQL runtime introduces unacceptable latencies.
By exporting Spark ML models and pipelines to PFA, Aardpfark dramatically simplifies deployment and enables low-latency scoring for real-time applications.
Why should I contribute?
The Aardpfark project currently covers many Spark ML predictors and transformers, as well as pipelines consisting of supported components. We aim to add coverage for almost all Spark ML components in the future, as well as Python support. We also aim to complete the Scala DSL for PFA creation and push out a first release. Further down the road, we intend to expand Aardpfark to cover other popular machine learning libraries, starting with scikit-learn. If you want to simplify deployment of machine learning models using open standards, we welcome contributions to the project in all these areas.