IBM has a clear strategy towards Open Tech AI. I’m sure you remember back in 2011 when IBM Watson defeated the two world champions in Jeopardy. Apart from being backed by the awesome number of 2,880 IBM POWER7 cores providing 11,520 hyper-threads and 16 TB main memory, IBM Watson was running on Linux and using Hadoop. A lot of things have happened since then, and we are at a stage where IBM is one of the world leaders in AI Technology and Services.
But what I’m especially proud of is IBM’s clear strategy to create, support, and enhance Open Tech AI.
Open Tech AI: End-to-end Enterprise AI
Back in 2015, IBM opened the Apache Spark Technology Center in San Francisco. The Apache Spark Technology Center has been renamed to CODAIT, which stands for the Center of Open-Source Data and AI Technologies).
In addition to working on the Apache Spark project, where IBM contributed more than 50,000 lines of code and remains the main contributor to Apache Apache Spark Machine Learning (specifically, ML Pipelines, the new data frame-based machine learning API).
Dr. Angel Diaz states in his blog: “Today, I am happy to announce Spark Technology Center’s expanded mission, which now encompasses the end-to-end Enterprise AI lifecycle.”
Let me walk you through a subset of important projects that CODAIT is contributing to:
- Apache Spark is the de facto standard when it comes to open source parallel data processing. Apache Spark still holds multiple world records in data processing performance. You can learn more about Apache Spark in my book, Mastering Apache Spark 2.x.
- Tensorflow is the most widely used DeepLearning framework on the planet. Everybody loves it. I hate it. I really don’t understand why people love such a low-level framework which requires you to define everything on a linear algebra level. But, nevertheless, TensorFlow has some important features for AI researchers. The most prominent feature is automatic differentiation, which takes care of creating the optimization objective for your model. To learn more about TensorFlow, watch my introduction to TensorFlow in this video that is part of our Coursera course, Applied AI with Deep Learning, which is part of the Coursera Advanced Data Science Specialization.
- Keras is exactly what a DeepLearning framework should be. (As much as I hate TensorFlow, I love Keras.) Keras is easy to understand and use. All you need to do is stack neural network layers, and under the hood it uses TensorFlow for execution. And, you can export Keras models and make them run in SystemML (see below) and DeepLearning4J, which both support Apache Spark as runtime. DeepLearning4J has the same easy to use high-level API as Keras but supports Java and Scala natively. You can learn more about Keras models and the exporter from watching my interview with Max Pumperla, the guy who developed the model exporter.
- SystemML is definitively one of the most underestimated DeepLearning frameworks in existence. SystemML is for linear algebra what SQL is for relational databases. A Domain Specific Language (in either R or python syntax) is optimized using a cost-based optimizer. SystemML can use Apache Spark as runtime. IBM committed 65,000 lines of code to SystemML. You can learn more about SystemML from watching my talk on SystemML at the Swiss Data Science Conference in 2016, or from watching my interview of Berthold Reinwald, the man behind Apache SystemML.
- Apache Arrow is a high-performance memory management library for fast column-based memory layout. Maybe you are using Apache Arrow in your daily life without knowing it. It is used if you convert between Apache Spark and python pandas DataFrames. At least, that’s how I’m using it and I’m very happy with its performance.
- Apache Bahir is a set of connectors for Apache Spark and Apache Flink for persistent and in-flight data sources. I’m maintaining a fork of Apache Bahir (https://github.com/romeokienzler/bahir), supporting access to IBM Cloudant, the managed Apache CouchDB in IBM Cloud, where the number of requests per second is limited.
- Apache Toree is a Jupyter kernel optimizing access to Apache Spark. The Jupyter open source project is so widely used that it became the de facto standard for developing data science scripts. Without Toree scripting, Apache Spark jobs on Jupyter never would have been so successful and widely adapted. Like Arrow, you never directly interact with Toree, but you are continuously benefiting from the fact that it is there.
- Apache Zeppelin is the runner up when it comes to data science scripting using notebooks. The UI is much nicer than Jupyter, but I’m still using Jupyter only for my day to day job.
- Apache Livy enables you to interact with an Apache Spark cluster via REST API without installing an Apache Spark client. This is essential for developing interactive user interfaces on top of data processing and analytics job (so called “data products”) backed by Apache Spark. Because otherwise you would need to wrap command line executions of Apache Spark jobs within your code including all the tedious life cycle management. Livy solves this problem for you. Really great stuff.
- Fabric for Deep Learning (FfDL) completely takes away from you the burden of managing a data parallel cluster infrastructure, like Apache Spark is, on top of virtual machines. Just throw your data into a bucket on ObjectStore, define your model in a Jupyter Notebook or visually using the Neural Network Modeler of Watson Studio, and train and scale it by using Watson Machine Learning. I’ve been working with it since the beta version of IBM Watson Machine Learning for quite a while. And I must say, it’s one of the most exiting Watson services for AI and ML Engineers. Currently, the following frameworks are supported, and that’s all I really need: TensorFlow, Keras, Caffee and Caffee2, PyTorch, Spark MLlib, scikit learn, XGBoost, and SPSS. But here comes the bang! IBM open-sourced the complete runtime of IBM Watson Machine Learning in this FfDL package. So, you can use the very same APIs in the IBM Cloud as on any other cloud provider or on-prem data center. And, you can contribute to it to make it better! All you need is a Kubernetes cluster with some CPUs or GPUs and an NFS server.
IBM Watson Studio and the Open Tech AI
IBM Watson Studio has also enhanced its AI capabilities, with its Neural Network Modeler that is part of IBM’s DeepLearning as a Service offering.
Designing a DeepLearning Neural Network quickly becomes confusing. Using the Neural Network Modeler in IBM Watson Studio, you can draw your neural network using a graphical user interface. And the cool thing is that it comes for free. It also reads and writes TensorFLow, Keras, Caffee, and PyTorch models, and very soon it will read ONNX models. With the Neural Network Modeler, you can graphically design neural networks on top of nearly every state-of-the-art open source DeepLearning library. This is part of IBM’s DeepLearning as a Service offering.
DeepLearning as a Service in IBM Watson Studio not only allows you to create DeepLearning Neural Networks without coding using a graphical editor, it also takes care of hyper-parameter tuning using the Experiment Assistant. So, different models are created automatically, and their performance is evaluated. Things you normally would do using GridSearch and TensorBoard can now be done with just a click of your mouse.
Learning more about Open Tech AI
I’m so excited to be a part of IBM’s Open Tech AI strategy that we started a conference series called Open Tech AI Summit (OTAIS ), which took place in Helsinki this March. The next one will take place this month on May 28th in Zurich, Switzerland. If you can make it, I’d be delighted to meet you F2F.
Last, but not least, if you are interested in the topic, please consider taking our Coursera course on Applied AI with DeepLearning, which is part of a Coursera specialization called “Advanced Data Science”. I think this set of courses is particularly great if you are interested in Open Tech AI. The DeepLearning course teaches you everything you need to know about DeepLearning and how to use it in Keras, TensorFlow and PyTorch. It shows you how to apply DeepLearning on real-world problems and explains how models can be scaled for training and inference.