Skill Level: Any Skill Level

Get your data right for modelling.


Hopefully, you’ll have spent adequate time in the business and data understanding phases of your predictive maintenance project so that you are as ready as possible for data preparation. In data preparation, you are looking to getting the data right for modelling, by performing some or all of the following tasks: merging various data sets, creating a sample subset if the data set is too large, aggregating data, creating new features, handling data quality issues, and splitting the data set into training and test data.

So for the remainder of the series, we fill frame this as an anomaly detection problem, we will assume that the business requires us to identify anomalies as they occur. I know this is a little contrived, but this is just so we can demonstrate the steps involved here.

First, let's import the libraries that we will need:

!conda install -c conda-forge imbalanced-learn 
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import ADASYN
from collections import Counter
from sklearn.svm import LinearSVC


  1. Data formatting

    You might recall from the article on data understanding that the timestamp column is in an odd format. This is actually a UNIX timestamp, and to make it more readable, we’ll need to convert to regular data time format.



    We can now better understand the dates and times in this column:




    We can now more clearly see that this data set contains 16220 observations taken every 50ms.

  2. Drop correlated attributes

    As we saw that the NVL_Recv_Storage.GL_I_Slider_IN and NVL_Recv_Ind.GL_NonMetall are negatively correlated with other features, we can drop them from our data set. We can also drop the time stamp feature too as it is irrelevant for our business objective.


    df_data_2 = df_data_1.drop(['Timestamp','NVL_Recv_Storage.GL_I_Slider_IN', 'NVL_Recv_Ind.GL_NonMetall'], axis=1)


  3. Create training and test data

    After we have trained and built our model, we need to check how good it is on unseen data. So, we’ll create a validation data set by taking 20% of the original and putting it to one side for now.


    array = df_data_2.values
    X = array[:,1:17]Y = array[:,0]validation_size = 0.20
    seed = 7
    X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

    Let’s do a sanity check on the spread of the target variables. As we can see, both the training and test data sets have a distribution across each class.


    num_zeros = (Y_validation == 0).sum()
    num_ones = (Y_validation == 1).sum()
    num_twos = (Y_validation == 2).sum()
    num_trzeros = (Y_train == 0).sum()
    num_trones = (Y_train == 1).sum()
    num_trtwos = (Y_train == 2).sum()
    print ("Number of examples of validation label 0:", num_zeros)
    print("Number of examples of validation label 1:", num_ones)
    print("Number of examples of validation label 2:", num_twos)
    print("Number of examples of training label 0:", num_trzeros)
    print("Number of examples of training label 0:", num_trones)
    print("Number of examples of training label 0:", num_trtwos)


  4. Normalize data

    The next step we’ll do is to normalize the training data. As you can see, the range of values in each column differs significantly, so we will want to rescale them into a range between 0 and 1. This helps with the optimization for many machine learning algorithms. There are several types of normalization algorithms, but for this exercise, we will just look at the minmaxscaler. It works particularly well if value distributions are not Gaussian (normal) or the standard deviation is very small.


    scaler = MinMaxScaler(feature_range=(0, 1))
    rescaledX_train = scaler.fit_transform(X_train)



  5. Address class imbalance

    As we saw during the data understanding phase, there is a huge class imbalance. A technique that I like to use to address this is to create synthetic examples that create an artificial balance. ADASYN (Adaptive Synthetic) is a popular algorithm to do this. ADASYN uses a nearest neighbors algorithm to create this artifical data for training our model.

    X_resampled, Y_resampled = ADASYN().fit_resample(rescaledX_train, Y_train)


  6. Data reduction

    Data reduction techniques can be applied to obtain a reduced representation of your data that is much smaller in volume, yet closely maintains the integrity of the original data. Thus, machine learning  will be more efficient yet produce more or less the same analytical results. There are many approaches to this, but for the purposes of this project we will look at a dimensionality reduction algorithm, recursive feature elimination (RFE), whereby irrelevant, weakly relevant, or redundant features are detected and removed.  In RFE, we use an estimator algorithm (in this case LinearSVC), which is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.



    model = LinearSVC(random_state=0, tol=1e-5)
    rfe = RFE(model, 5)
    fit = rfe.fit(X_resampled, Y_resampled)
    print("Num Features: %d" % fit.n_features_)
    print("Feature Ranking: %s" % fit.ranking_)



    We have basically identified the top 5 features and ranked the others. We will consider just starting with those during modelling.

  7. Next steps

    During the modelling phase, which is next, we will assess how useful these data preparation exercises actually are in improving the accuracy of the models. Stay tuned!

Join The Discussion