With a bit of bad luck, as a data scientist you spend 80 percent of your time searching for the right data. A downright waste of time, money and energy. The open source-based and fully integrated Watson Studio gives you exactly the tools you need.
As a data scientist, you increasingly play a role in addressing business issues. But without quick access to relevant data and adequate tools you are not able to do anything. Watson Studio helps you as a professional to switch quickly from the business question to a solution.
Suppose your organization wants insight into the chances of churn from individual customers. No accidental example, because churn prediction is the ‘hello world’ of machine learning: the first steps you take in this area. It is well-known territory and, moreover, nicely binary: the customer leaves, or not. The first thing you do as a data scientist with Watson Studio is creating a working environment for the project, in which all data and machine learning assets are brought together. Including all the people involved, the so-called ‘collaborators’.
Then the correct data is found with the help of the available tools. You look into the so-called Knowledge Catalog to see which data may be relevant for the specific question. When you zoom in from the graphical user interface on that data source, you get all sorts of information about that data in a ‘review’. The profile shows what the data looks like, what it contains, and whether things are missing. Comments can also be added to a specific data sample. Once it has been decided that the data is valuable for the assignment, it can be added as a project asset, either the data file itself or the connection to the data.
Note that the Knowledge Catalog does not contain the data itself, but only enough information for you to decide whether the asset is valuable for the project or not. Knowledge Catalog has connectors to both cloud and on-premises resources (both IBM and non-IBM).
Often this data is not completely ready to be fed to the machine learning algorithms. In that case the data must first be ‘refined’: cleaning, de-duplication, removal of irrelevant fields. Also in this process step the data can be enriched with additional information or by combination with other sources. The Data Refinery tool offers a GUI driven approach to complete this task. This tool is based on the popular R-library dplyr – actually sparklyr, since the entire Watson Studio platform is based on Apache Spark. Data can thus be manipulated and prepared step-by-step. Everything integrated and based on open source.
Ready to feed
The next phase is the ‘run’, in which the data is processed effectively and prepped. To this end, a new data asset is created. Ready to feed to the machine learning algorithms. A model can then be created which can predict the churn. In this case we chose a Binary Classification. The performance of multiple mathematical models, for example ‘decision tree’ or ‘random forest’ can be compared.
With a few clicks the best model can be put into production as an API, or for use in batch or real-time predictions.
Watson Knowledge Catalog: Based on open source tools
The above illustrates how, with the help of the right tools, you work in about half an hour from a business question to a well-functioning churn model. Obviously, an ideal world is assumed here in which the right data is available in the right format, but this overview for sure shows that Watson Studio is a valuable tool for a fast journey from business question to a model in production.
Do note that Watson Studio is based on 100% open source tools such as Jupyter Notebooks and RStudio and open source software such as Apache Spark, Python (Anaconda) and R. What the platform adds are visual coding aids such as Data Refinery (dplyr) and Neural Network Modeler (Tensorflow, Keras, Caffe, PyTorch), collaboration tools such as Knowledge Catalog and the Community, but certainly support for a fast voyage from business problem to ML/AI solution.
Would you like to try it yourself? Follow this code pattern “Analyze bank marketing data using XGBoost to gain insights into client purchases” to learn more: