As part of my own learning, continuing from Part 1 and trying to improve our neural network model, we will use some of the well-known machine learning techniques mentioned on TensorFlow.

In the previous article, we saw certain problems with our training. Here, we will address them and see if our results improve as we go.

Problems observed in the previous solution

Overfitting

A model is considered to overfit when it performs with great accuracy on the training data (data used for training the model), but when evaluated against a test or unseen data set, it performs rather poorly. This happens because our model has overfit the data.

Training accuracy if higher than testing accuracy is a clear indicator of this phenomenon. Thankfully, there are some techniques available to solve this.

Model size

First, look at the size of the model, meaning the number of units. If the model used is far bigger than the problem at hand, it is more likely to learn the features/patterns not relevant to the problem and thus overfit to the training data. A larger model will not generalize well, and a smaller model will underfit the data.

Taking the model in our previous article as a baseline, we will evaluate the result of reducing the size and increasing the size on the performance of the model. The following models were tried and compared:


baseline_model = keras.models.Sequential([
        keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
                keras.layers.Dense(128, activation=tf.nn.sigmoid),
                keras.layers.Dense(16, activation=tf.nn.sigmoid),
        keras.layers.Dense(2, activation=tf.nn.softmax)
        ])

bigger_model2 = keras.models.Sequential([
                keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
                keras.layers.Dense(1024, activation=tf.nn.relu),
                keras.layers.Dense(512, activation=tf.nn.relu),
                keras.layers.Dense(64, activation=tf.nn.relu),
                keras.layers.Dense(16, activation=tf.nn.relu),
        keras.layers.Dense(2, activation=tf.nn.softmax)
        ])

bigger_model1 = keras.models.Sequential([
        keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
                keras.layers.Dense(512, activation=tf.nn.relu),
                keras.layers.Dense(128, activation=tf.nn.relu),
                keras.layers.Dense(16, activation=tf.nn.relu),
        keras.layers.Dense(2, activation=tf.nn.softmax)
        ])

smaller_model1 = keras.models.Sequential([
        keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
                keras.layers.Dense(64, activation=tf.nn.relu),
        keras.layers.Dense(2, activation=tf.nn.softmax)
        ])

To determine the ideal model, we plot loss function of validation data against number of epochs.

  1. Comparison of smaller, bigger, and baseline models.

    Size comparison

  2. Comparison of bigger, bigger2, and baseline models.

    Size comparison continued

In these plots, we see that validation loss (sparse_categorical_crossentropy) is almost similar for bigger and bigger2 models, and better than smaller and baseline models. So we go ahead and select these models over our baseline model for further tuning.

Number of epochs

The number of epochs plays an important role in avoiding overfitting and overall model performance. In the comparison graphs plotted in above section, we observe the loss function for validation data reaches a minimum and on further training, increases again, while loss function of training data reduces further. This is exactly what overfitting means; the model learns patterns specific to the data set and does not generalize well, so it does better with training data than validation data. We have to stop before the model overfits the data. So in above case epoch value of 40 is ideal.

L1 and L2 regularization

The effect of applying L2 regularization is that of adding some random noise to the layers. The plot below shows the effect of applying this on our model.

Regularization

Using Dropout

Keras library provides a dropout layer, a concept introduced in Dropout: A Simple Way to Prevent Neural Networks from Overfitting(JMLR 2014). A consequence of adding a dropout layer is that training time is increased, and if the dropout is high, underfitting.

Models after applying the dropout layers:

bigger_model1 = keras.models.Sequential([
                keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
                keras.layers.Dense(512, activation=tf.nn.relu),
                keras.layers.Dropout(0.5),
                keras.layers.Dense(128, activation=tf.nn.relu),
                keras.layers.Dense(16, activation=tf.nn.relu),
        keras.layers.Dense(2, activation=tf.nn.softmax)
        ])

bigger_model2 = keras.models.Sequential([
                keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
                keras.layers.Dense(1024, activation=tf.nn.relu),
                keras.layers.Dropout(0.5),
                keras.layers.Dense(512, activation=tf.nn.relu),
                keras.layers.Dropout(0.5),
                keras.layers.Dense(64, activation=tf.nn.relu),
                keras.layers.Dense(16, activation=tf.nn.relu),
        keras.layers.Dense(2, activation=tf.nn.softmax)
        ])

The following image shows the effect of applying dropout regularization.

dropout regularization

During one run, the bigger model did not converge at all, even after 250 epochs. This is one of the side effects of applying dropout regularization.

converge

Lack of training data

With only 26 or so training examples, we have done reasonably well. But for image processing, there are several techniques of data augmentation by applying some distortion to the original image and generating more data. For example, for every input image, we can have an invert color image added to our data set. To achieve this, the load_image_dataset function from Part 1 is modified as follows (it is also possible to add a randomly rotated image for each original image):

# invert_image if true, also stores an invert color version of each image in the training set.
def load_image_dataset(path_dir, maxsize, reshape_size, invert_image=False):
        images = []
        labels = []
        os.chdir(path_dir)
        for file in glob.glob("*.jpg"):
                img = jpeg_to_8_bit_greyscale(file, maxsize)
                inv_image = 255 - img # Generate a invert color image of the original.

                if re.match('chihuahua.*', file):
                        images.append(img.reshape(reshape_size))
                        labels.append(0)
                        if invert_image:
                                labels.append(0)
                                images.append(inv_image.reshape(reshape_size))
                elif re.match('muffin.*', file):
                        images.append(img.reshape(reshape_size))
                        labels.append(1)
                        if invert_image:
                                images.append(inv_image.reshape(reshape_size))
        return (np.asarray(images), np.asarray(labels))

The effects of adding invert color images and randomly rotating images on training with dropout on is as follows. The size of the data set increased by 3X.

sigmoid dropout

The result indicates that this has worsened the overfit of the data.

Note: For data augmentation, Keras provides a built-in utility, keras.preprocessing.image.ImageDataGenerator, which will not be covered here.

Another way to overcome the problem of minimal training data is to use a pretrained model and augment it with a new training example. This approach is called transfer learning. Since TensorFlow and Keras provide a good mechanism for saving and loading models, this can be quite easily achieved, but out of scope here.

Conclusion

On further testing with different models and activation functions, the best results were observed by using sigmoid as activation function and a dropout layer in our baseline model. Similar performance was observed with the relu activation function, but with sigmoid, the curve was smoother. And since the size of the image was reduced to 50×50, it improved the training time without impacting the performance of models.

Apart from the above, I have also tested a VGG-style multilayer CNN model and multiple variations of CNN models, but somehow the results were quite poor with it.

The following image shows the plot of the results from all three models.

conclusion

Baseline model used:

baseline_model = keras.models.Sequential([
        keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
                keras.layers.Dense(128, activation=tf.nn.sigmoid),
                keras.layers.Dropout(0.25),
                keras.layers.Dense(16, activation=tf.nn.sigmoid),
        keras.layers.Dense(2, activation=tf.nn.softmax)
        ])

baseline_model.compile(optimizer=keras.optimizers.Adam(lr=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy','sparse_categorical_crossentropy'])

Output:

- 0s - loss: 0.0217 - acc: 1.0000 - sparse_categorical_crossentropy: 0.0217 - val_loss: 0.2712 - val_acc: 0.9286 - val_sparse_categorical_crossentropy: 0.2712
Epoch 119/400
 - 0s - loss: 0.0224 - acc: 1.0000 - sparse_categorical_crossentropy: 0.0224 - val_loss: 0.2690 - val_acc: 0.9286 - val_sparse_categorical_crossentropy: 0.2690
Epoch 120/400

Results:

results

Next, I would like to improve my understandings of CNN and VGG-style networks for image recognition and even more advanced usages of neural networks.