Machine learning with IBM Watson AutoAI: Data exploration and visualization

About this video

In this first part of a three-part video series on experimenting, automating, and deploying a machine learning model using IBM Watson AutoAI, learn about data exploration and visualization.

When you’ve watched this video, continue with Part 2. The next part, Part 3, explains how to connect the model API to a web app. The demo video ties it all together.

Transcript for this video

Hey everybody. How’s it going? Horrea Porutiu here, a software engineer at IBM.

Today, I want to talk about one of the latest AI, artificial intelligence, solutions that I built revolving around machine learning, and I think you’ll find it really useful in learning about a really easy, and quick, and free way to apply artificial intelligence and really build a useful project from a free data set from Kaggle.

So before we get started, I want to break down the project and tell you exactly how this video series is going to be divided. There’s going to be four videos. The first one is going to be a quick three-minute video of the end product. That’s going to be a Python web application that’s going to be talking to our deployed model machine learning model in the cloud. That one is just a quick demo with the UI and showing how to fill in the form and then actually using the model in the back end.

Next is going to be the video that you’re watching now. This is the architecture demo. It’s going to be kind of the project overview, the actual cloud services and the actual architecture of the project. Then, we’re going to explore the data set, so we’re going to look at the Kaggle data set and then use Python visualization libraries to see if there’s any correlation between the data set before we run any machine learning experiments on it. The third video is going to be kind of focusing on creating all the cloud services. You’re going to create IBM Watson Studio to create your project, and a cloud object storage to store the data set, and then a machine learning instance so that you could process and run all these experiments to create pipelines and eventually deploy a machine learning model.

The last video, the fourth one, will be about implementation and using that deployed model and finding that best performing model, then deploying it as a web service and connecting our Python Flask application to that deployed model in the cloud. So, we’re going to work with our API keys for machine learning and then also our instance IDs and all the credentials that we need to actually connect to our deployed model on the cloud. I’ll have all these video series in the description, so if you’re only interested in one of them, go ahead and switch to that video. If you want to keep learning about the data set and the project itself, keep watching this one. So thanks again for watching, and I hope you enjoyed this.

Let me show you the data set first. We can go to Kaggle. This is a great way to find different data sets that you may be able to use in your projects and our data set that we will use today. Of course, you could use any other as well, but we will use this insurance file. You can see it’s a pretty small file. It’s only a thousand three hundred rows and seven columns. What we’re trying to predict is expenses, how expensive exactly is our insurance costs going to be and the different features that we have, our age, sex, gender, BMI, number of children, whether or not you’re a smoker, and the region. We want to see if there’s any bias within the gender, but also whether different regions actually affect your insurance charges as well. That’s kind of an overview of the data set, so let me show you kind of where to find this project and what the overall project looks like. We do have a GitHub page. It’s ibm/predict-insurance-charges. Note that machine learning is a subset of AI, and in this particular example we are going to be using machine learning algorithms.

Here’s kind of the main outline of the project. This is kind of the end web app, we have a Flask application. You can see we predict the insurance charge. That doesn’t mean much to you right now, but if you look at the data set it’ll make more sense. This is kind of what the architecture looks like. As the data scientist, we basically create a Watson Studio instance with an IBM Cloud. We create an instance of object storage, which is where we keep our data set, and this is all completely free, no credit card required. Nothing. Users have to sign up for an IBM Cloud account, and the third step is we feed in this data file, and then we run AutoAI experiment on it, which is essentially we’re going to tell AutoAI what we want to predict. AutoAI is going to run different algorithms and try to maximize and optimize for the metrics that we’re going for. So specifically for us, we’re going to try to minimize the root mean squared error, and if you’re a data scientist you will know exactly what that means. If you’re not, don’t worry about it. We’re just going to try to minimize the error of our prediction. Lastly, we’re going to use a IBM Watson Machine Learning to run all these experiments, and then we’re going to use our machine learning algorithm or machine learning model and be able to put a UI over it so that once we actually do enter in different fields in our UI, we can talk to our machine learning model and predict a certain outcome, which in our case is the insurance charge.

Now we’re going to talk about the first optional part of the project, which is exploring the data. You can also see the demo here, and watch the video section. But the first part is actually you can open this notebook whether it’s in Watson Studio or just in GitHub. But it’s a Jupyter Notebook. We’re doing some data exploration at the beginning, so you do need to put your API key from IBM Cloud Object Storage and your endpoint URL. These are all free. You just click on the service credentials. But, the first thing we do is just call the head method on our data frame so we get the first five rows in our box. Here is actually the difference between charges for a smoker who said yes versus a nonsmoker. We can see the average is about 35,000 for smoker versus the average for a nonsmoker about 7,000. We can see this box bar illustrate this very easily. Then gender impact. So we can see if there’s bias within the gender, and we see that there isn’t too bad of a bias there. Then, we also have region here as well, and you can run this within Watson Studio, so that’s kind of the first optional step.

The meat of the project kind of comes from step three, so we want to create our IBM Cloud services.

So in this video, you’ve just learned about the project architecture, what cloud services we’re going to use – mainly IBM Cloud Object Storage, IBM Watson Studio, and IBM Watson Machine Learning. And then we’ve learned about the data set. We’re using a free open source insurance data set from Kaggle and we’re going to be predicting insurance charges. Lastly, we’ve seen a Jupyter Notebook that uses visualization libraries to find any sort of correlations between the features and the data. What we’re trying to predict is the charges, so we’ve done a little bit of data exploration, which is nice so you have some sort of idea of what features are important to predicting the outcome, which is again the insurance charges. In the next video, we’ll start looking at how to create all these cloud services, and then run the AutoAI Experiment, which will generate these machine learning models.

Thanks for watching, and see you in the next video.