Skill Level: Any Skill Level

Learn how to better understand your data by understanding the features in your data, what do they look like, and are they interesting.


You should have already established the business objectives for your predictive maintenance project. The next step is to actually look at the data you have. Real-world data is usually typically noisy, large, and from many disparate sources. This step is about accessing and exploring your data, assessing the quality of the data and understanding it in more detail. This knowledge is essential for the next step in the process – data preparation.


We will be using your project notebook in Watson Studio using Python. Your access to Watson Studio is contained in your licence for APM Predictive Maintenance Insights. If you do not yet have APM Predictive Maintenance Insights set up, you can play along with a free trial of Watson Studio for now.


For demonstration purposes for the rest of this series, we are going to use an open data set for predictive maintenance: the Genesis data set. Go ahead and download this data to get ready for this exercise. The data set represents a portable pick-and-place demonstrator which uses an air tank to supply all the gripping and storage units.


We can state that the business objective is to identify anomalies as they occur to reduce unnecessary preventive maintenance  activity. Obviously, we have greatly simplified the objectives and problem so that you can quickly learn how to get started with predictive maintenance.


In this exercise, some of the questions we will want to answer include:

  • What are the types of features in the data set?
  • What kind of values does each feature have?
  • Which features are discrete, and which are continuous?
  • How are the feature values distributed?
  • Can we better understand the data through visualization?
  • Are there any outliers?
  • How similar are some data points to others?


 One prequisite is to import the correct libraries for this exercise. Enter and run the following code after you have set up your project (more details in step 1):


import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb


  1. Gathering your data

    The first step is to gather the relevant asset and maintenance data for your analysis. This data might include data from your Maximo database, any SCADA or operational data, or data in that you have stored in the Watson IoT Platform. Once you have gathered this data, you can bring it into your project in Watson Studio. For information on how to do this, see the documentation for Watson Studio here and here

    In our example, we have loaded our Genesis data into our notebook in Watson Studio. Watson Studio automatically loads the required Python library for data analysis, Pandas; creates a data frame for you; and, generates a preview of the data.

    After this data is loaded, we will call the data frame:

  2. Describing the data

    Next, let’s get the dimensions of the data set by entering the following code. 



    You should see that the data set has 16209 rows and 20 features. The size of this data set is relatively small, so we shouldn’t encounter any processing issues when it comes to data modelling.


    Next, let’s look at the data types that we have. We can see that all we are handling is numeric data.


    print (df_data_1.dtypes)



    Let’s look at a statistical summary of each features in the dataset. Underneath is a screenshot of a subset of the output.






    For each feature, you can see the total count, the min and max values, the interquartile range, the mean, and the standard deviation. Things you might want to note here are the standard deviation – if it is 0, then that feature is not going to have predictive capabilities. Also, in this data set we can see that many of the features are Boolean values.



  3. Data quality

    The quality of the data is a concern for every predictive maintenance exercise. Let’s do a quick test for data quality, even though we can assume that it is good in this example as the data has already been scrubbed for us. In the following code, we are only looking for the number of missing values. Missing data is one type of data quality issue, and probably the most common, but there are more. For example, in asset data, we often find the same install dates for all assets in the system, which is often a data entry mistake. As you can see, we don’t have any missing data, which is great.


  4. Exploring the data

    Next, we will want to explore the data in more detail. Typically you will want to document any hypotheses that you are beginning to form, any promising attributes, and whether the data has altered the business objectives in any way.


    First, we look at skew, which measures the asymmetry of the distribution of each feature i.e. how much a distribution has shifted to the left or right. It is important to understand this for data preparation as many machine learning algorithms assume a normal distribution. The skew results show a positive (right) or negative (left) skew. Values closer to zero have less skew.


    skewdf = df_data_1.skew()




    We can see there is significant skew in features like NVL_Recv_Storage.GL_LightBarrier and NVL_Send_Storage.ActivateStorage, as well as the target variable, Label.


    Let’s check out the distribution of the target variable, Label, to see the extent of any class imbalance. We’ll use a bar chart to illustrate it as well as return the numerical breakdown.


    df_data_1.Label.value_counts().plot(kind = 'bar', color = 'green')




    We can see that the quantities of labels 1 and 2 are so small, they are not appearing in the bar chart. Let’s see what the actual numbers are:






    We can see there are over sixteen thousands examples of no anomaly, and only 50 examples of anomalies: 39 examples of type 1 and only 11 of anomaly type 2. This points to a significant class imbalance problem, which we will need to address during data preparation.


    Let’s examine the all features in more detail. We will create box plots and histograms of each feature. These visualizations will help us better understand the range of values for each feature.


    df_data_1.plot(kind='box', subplots=True, figsize=(20,20), layout=(5,4), sharex=False, sharey=False)

    df_data_1.hist(figsize=(20,20), layout=(5,4), color = 'green')




    In each box plot, the box length is the interquartile range, the median is the line in the box, and the two lines outside the box are the minimum and maximum values. From the histogram, we can see the distribution of the Boolean values in several of the features, as well as several distributions that follow a normal curve. Motordata.actspeed is interesting as looking at both the box plot and histogram, we see a normal curve, but a number of outliers that are skewing the curve to the right.


    Finally, we’ll plot a correlation matrix. A correlation matrix illustrates correlation coefficients between sets of features. Each feature in the matrix is correlated with each of the other features in the matrix. This enables us to see which pairs have the highest and lowest correlation.


    cov = df_data_1.corr()
    fig = plt.figure(figsize = (15,15))

    mask = np.zeros_like(cov, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Generate a custom diverging colormap
    cmap = sb.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sb.heatmap(cov, mask=mask, cmap=cmap, vmax=.3, center=0,
    square=True, linewidths=.5, cbar_kws={"shrink": .5})






    As we can see from the matrix, there is negative correlation between features like NVL_Recv_Storage.GL_I_Slider_IN and NVL_Recv_Storage.GL_I_Slider_OUT, which makes intuitive sense – if the slider is in, then it can’t be out! We see some positive correlation between several features, for example MotorData.IsForce and MotorData.ActCurrent, but none high enough to be of interest. However, the high negative correlation might be an issue for us as some algorithms do not handle those features very well. This phenomenon is known as multicollinearity, and is something we will need to address during data preparation, which is the next stage in the process.

  5. Next steps…

    Check out the next article in the series to learn more about data preparation.

1 comment on"Part 2: Predictive maintenance 101 with Maximo APM Predict"

  1. This is a great article and a great series, Tristan!

Join The Discussion