Stepbystep

Gathering your data
The first step is to gather the relevant asset and maintenance data for your analysis. This data might include data from your Maximo database, any SCADA or operational data, or data in that you have stored in the Watson IoT Platform. Once you have gathered this data, you can bring it into your project in Watson Studio. For information on how to do this, see the documentation for Watson StudioÂ here and here.Â
In our example, we have loaded our Genesis data into our notebook in Watson Studio. Watson Studio automatically loads the required Python library for data analysis, Pandas; creates a data frame for you; and, generates a preview of the data.
After this data is loaded, we will call the data frame:
df_data_1

Describing the data
Next, let’s get the dimensions of the data set by entering the following code.Â
print(df_data_1.shape)
Â
You shouldÂ see that the data set has 16209 rows and 20 features. The size of this data set is relatively small, so we shouldnâ€™t encounter any processing issues when it comes to data modelling.
Â
Next, letâ€™s look at the data types that we have. We can see that all we are handling is numeric data.
Â
print (df_data_1.dtypes)
Â
Let’s look at a statistical summary of each features in the dataset. Underneath is a screenshot of a subset of the output.
Â
print(df_data_1.describe())
Â
Â
For each feature, you can see the total count, the min and max values, the interquartile range, the mean, and the standard deviation. Things you might want to note here are the standard deviation â€“ if it is 0, then that feature is not going to have predictive capabilities. Also, in this data set we can see that many of the features are Boolean values.
Â
Â

Data quality
The quality of the data is a concern for every predictive maintenance exercise. Let’s do a quick test for data quality, even though we can assume that it is good in this example as the data has already been scrubbed for us. In the following code, we are only looking for the number of missing values. Missing data is one type of data quality issue, and probably the most common, but there are more. For example, in asset data, we often find the same install dates for all assets in the system, which is often a data entry mistake. As you can see, we don’t have any missing data, which is great.
Â
df_data_1.isnull().sum().sum()

Exploring the data
Next, we will want to explore the data in more detail. Typically you will want to document any hypotheses that you are beginning to form, any promising attributes, and whether the data has altered the business objectives in any way.
Â
First, we look at skew, which measures the asymmetry of the distribution of each feature i.e. how much a distribution has shifted to the left or right. It is important to understand this for data preparation as many machine learning algorithms assume a normal distribution. The skew results show a positive (right) or negative (left) skew. Values closer to zero have less skew.
Â
skewdf = df_data_1.skew()
print(skewdf)Â
Â
We can see there is significant skew in features like NVL_Recv_Storage.GL_LightBarrier and NVL_Send_Storage.ActivateStorage, as well as the target variable, Label.
Â
Let’s check out the distribution of the target variable, Label, to see the extent of any class imbalance. We’ll use a bar chart to illustrate it as well as return the numerical breakdown.
Â
df_data_1.Label.value_counts().plot(kind = 'bar', color = 'green')
Â
Â
We can see that the quantities of labels 1 and 2 are so small, they are not appearing in the bar chart. Letâ€™s see what the actual numbers are:
Â
print(df_data_1.groupby('Label').size())
Â
Â
We can see there are over sixteen thousands examples of no anomaly, and only 50 examples of anomalies: 39 examples of type 1 and only 11 of anomaly type 2. This points to a significant class imbalance problem, which we will need to address during data preparation.
Â
Let’s examine the all features in more detail. We will create box plots and histograms of each feature. These visualizations will help us better understand the range of values for each feature.
Â
df_data_1.plot(kind='box', subplots=True, figsize=(20,20), layout=(5,4), sharex=False, sharey=False)
plt.show()
df_data_1.hist(figsize=(20,20), layout=(5,4), color = 'green')
plt.show()Â
Â
In each box plot, the box length is the interquartile range, the median is the line in the box, and the two lines outside the box are the minimum and maximum values. From the histogram, we can see the distribution of the Boolean values in several of the features, as well as several distributions that follow a normal curve. Motordata.actspeed is interesting as looking at both the box plot and histogram, we see a normal curve, but a number of outliers that are skewing the curve to the right.
Â
Finally, we’ll plot a correlation matrix. A correlation matrix illustrates correlation coefficients between sets of features. Each feature in the matrix is correlated with each of the other features in the matrix. This enables us to see which pairs have the highest and lowest correlation.
Â
cov = df_data_1.corr()
fig = plt.figure(figsize = (15,15))
mask = np.zeros_like(cov, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Generate a custom diverging colormap
cmap = sb.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sb.heatmap(cov, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()Â
Â
Â
Â
As we can see from the matrix, there is negative correlation between features like NVL_Recv_Storage.GL_I_Slider_IN and NVL_Recv_Storage.GL_I_Slider_OUT, which makes intuitive sense â€“ if the slider is in, then it canâ€™t be out! We see some positive correlation between several features, for example MotorData.IsForce and MotorData.ActCurrent, but none high enough to be of interest. However, the high negative correlation might be an issue for us as some algorithms do not handle those features very well. This phenomenon is known as multicollinearity, and is something we will need to address during data preparation, which is the next stage in the process.

Next steps…
Check out the next article in the series to learn more about data preparation.
This is a great article and a great series, Tristan!