Apache SystemML

NOTE: SystemML has graduated from developerWorks Open and is now an Apache Incubator project. For current information, see the Apache SystemML website.

Machine learning (ML) is the capability of computers to learn without being explicitly programmed. Although the broad ideas around ML are well formulated, the field continues to rapidly gain interest due to the the vast proliferation of digital data and the ready availability of compute power to digest it all. The age of thinking machines is upon us.

Apache SystemML advances machine learning in two very important ways. First, the Apache SystemML language, Declarative Machine Learning (DML), includes linear algebra primitives, statistical functions, and ML-specific constructs that make it easier and more natural to express ML algorithms. Algorithms can be expressed in either an R-like or a Python-like syntax. DML significantly increases the productivity of data scientists by providing full flexibility in expressing custom analytics as well as data independence from the underlying input formats and physical data representations.

Second, Apache SystemML provides automatic optimization according to data and cluster characteristics to ensure both efficiency and scalability. Apache SystemML runs in MapReduce or Spark environments.

See these pages for more background:

Why should I contribute?

You’ll learn about Apache Spark and the DML scripting language, but probably the most important takeaway will be how to implement an advanced ML system in an advanced, parallel, distributed environment.

What technology problem will I help solve?

Apache SystemML will benefit from contributions in several areas. Data scientists can contribute new algorithms or enhance existing ones by making them more robust and accurate. Engineers can build support for other distributed platforms and help with the parser or improve the performance of the runtime.

How will Apache SystemML help my business?

Apache SystemML promises to greatly improve the productivity of analysts and data scientists by providing 1) DML, a declarative, R-like language for flexibly expressing custom analytics and 2) data independence from the underlying input formats and physical data representations.