You might recall from the article on data understanding that the timestamp column is in an odd format. This is actually a UNIX timestamp, and to make it more readable, we’ll need to convert to regular data time format.
We can now better understand the dates and times in this column:
We can now more clearly see that this data set contains 16220 observations taken every 50ms.
Drop correlated attributes
As we saw that the NVL_Recv_Storage.GL_I_Slider_IN and NVL_Recv_Ind.GL_NonMetall are negatively correlated with other features, we can drop them from our data set. We can also drop the time stamp feature too as it is irrelevant for our business objective.
df_data_2 = df_data_1.drop(['Timestamp','NVL_Recv_Storage.GL_I_Slider_IN', 'NVL_Recv_Ind.GL_NonMetall'], axis=1)
Create training and test data
After we have trained and built our model, we need to check how good it is on unseen data. So, we’ll create a validation data set by taking 20% of the original and putting it to one side for now.
array = df_data_2.values
X = array[:,1:17]Y = array[:,0]validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)
Let’s do a sanity check on the spread of the target variables. As we can see, both the training and test data sets have a distribution across each class.
num_zeros = (Y_validation == 0).sum()
num_ones = (Y_validation == 1).sum()
num_twos = (Y_validation == 2).sum()
num_trzeros = (Y_train == 0).sum()
num_trones = (Y_train == 1).sum()
num_trtwos = (Y_train == 2).sum()
print ("Number of examples of validation label 0:", num_zeros)
print("Number of examples of validation label 1:", num_ones)
print("Number of examples of validation label 2:", num_twos)
print("Number of examples of training label 0:", num_trzeros)
print("Number of examples of training label 0:", num_trones)
print("Number of examples of training label 0:", num_trtwos)
The next step we’ll do is to normalize the training data. As you can see, the range of values in each column differs significantly, so we will want to rescale them into a range between 0 and 1. This helps with the optimization for many machine learning algorithms. There are several types of normalization algorithms, but for this exercise, we will just look at the minmaxscaler. It works particularly well if value distributions are not Gaussian (normal) or the standard deviation is very small.
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
Address class imbalance
As we saw during the data understanding phase, there is a huge class imbalance. A technique that I like to use to address this is to create synthetic examples that create an artificial balance. ADASYN (Adaptive Synthetic) is a popular algorithm to do this. ADASYN uses a nearest neighbors algorithm to create this artifical data for training our model.
X_resampled, Y_resampled = ADASYN().fit_resample(rescaledX_train, Y_train)
Data reduction techniques can be applied to obtain a reduced representation of your data that is much smaller in volume, yet closely maintains the integrity of the original data. Thus, machine learning will be more efficient yet produce more or less the same analytical results. There are many approaches to this, but for the purposes of this project we will look at a dimensionality reduction algorithm, recursive feature elimination (RFE), whereby irrelevant, weakly relevant, or redundant features are detected and removed. In RFE, we use an estimator algorithm (in this case LinearSVC), which is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
model = LinearSVC(random_state=0, tol=1e-5)
rfe = RFE(model, 5)
fit = rfe.fit(X_resampled, Y_resampled)
print("Num Features: %d" % fit.n_features_)
print("Feature Ranking: %s" % fit.ranking_)
We have basically identified the top 5 features and ranked the others. We will consider just starting with those during modelling.
During the modelling phase, which is next, we will assess how useful these data preparation exercises actually are in improving the accuracy of the models. Stay tuned!