Developing a test design
Before you build a model, you should think about how you can assess the models suitability. Typically you need to understand how you can determine how good the model will be , and also think about the data will you be testing on.
As this is a supervised learning problem, we will use accuracy to guide our training. We will use cross-validation during the training phase of our modelling exercise. Usually, on imbalanced data, you might use F1 Score to assess goodness, but as we have rebalanced the data set when preparing the data, this is unnecessary. We should, however, take a look at F1 when weÂ test on our validation data set.Â
Selecting the model
Determining the most appropriate model will typically be based on the data you have available, the modelling goals, and the requirements of the model itself and of the output.
The approach that I like to use is to create a test harness to cycle the data through different models to see which one fits best, at least at a high level. In this scenario, we have a classification problem, so I have put together some models that typically work well.
These are only a small number of available models — there are dozens to try and as you get more experience you can often intuit which models or model type might work best for the data you have. Remember, however, that any model selection should be rigorously tested.
models = 
models.append(('NC', NearestCentroid()))Â Â Â Â Â Â Â Â Â Â Â Â Â
models.append(('KNN', KNeighborsClassifier()))Â Â Â Â Â Â Â Â Â Â Â Â Â
scoring = 'accuracy'
results = 
names = 
for name, model in models:
Â Â Â kfold = KFold(n_splits=10, random_state=7)
Â Â Â cv_results = cross_val_score(model, X_resampled, Y_resampled, cv=kfold, scoring=scoring)
Â Â Â results.append(cv_results)
Â Â Â names.append(name)
Â Â Â msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
Â Â Â print(msg)
Building the models
During this part of the process, it is important to understand the models you have built, the parameter settings for those models, and any performance or data issues that you encountered.
In order to track your progress with a variety of models, be sure to keep notes on the settings and data used for each model. This will help you to share the results with others and retrace your steps.
As we can see,Â KNN performed best, so letâ€™s dive into that model in more detail. There are several parameters that can be adjusted for KNN, but for illustrative purposes, we will focus on the value for k. We will perform a grid search, which will help us tune that parameter by building a model for each parameter permutation and find the best performing one. Note that this might take some time to run so I would recommend chunking the code into three different cells in your notebook.
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
Once you have determined the parameters that produce the most accurate results, be sure to take note of them. This can help you when you decide to automate or rebuild the model with new data. In this case, we can see that the optimal value for K is 1, which generated an accuracy of 0.9999484722007523!
Itâ€™s also important that when you assess the model, take note of key information such as
- Meaningful conclusions
- Any new insights
- Model execution issues and processing time
- Any problems with data quality
- Any calculation inconsistencies
Assessing the model
Now that you have a model that is achieving a goodness of fit, letâ€™s take a closer look at it to determine if it is accurate or effective enough to be deployed.
Itâ€™s a good idea to be methodical and base it on your test plan.
For our purposes, we will test our model on the unseen data that we created in our validation data set during data preparation.
Remember how we also rescaled the data during data preparation? Well, we need to apply the same scaling to the validation set:
rescaledX_validation = scaler.fit_transform(X_validation)
Next, we will run the model on the unseen data. As this data set is very imbalanced, we will focus on the F1 score, which is a better guide than accuracy for imbalanced data. We will also generate a confusion matrix and a report on the classification outcomes.
knn = KNeighborsClassifier(n_neighbors=1)
predictions = knn.predict(rescaledX_validation)
The results are pretty impressive, and the model achieved an F1 score of 98.
At this stage in the process, if you think that the model meets your predictive maintenance objectives, you can move on to a deeper evaluation of the models and look to deploy. Be sure that you can answer the following questions before you do decide to move to the next stage:
- Can you understand the results of the model?
- Do the model results make logical sense and are free from glaring inconsistencies e.g. terrific results in training, but awful results on unseen data?
- Do the results meet your business objectives??
- Have you thoroughly evaluated the model accuracy?
- Have you looked at multiple models and compared the results?
- Are the results of your model deployable?