As part of my own learning, continuing from Part 1 and trying to improve our neural network model, we will use some of the well-known machine learning techniques mentioned on TensorFlow.

In the previous article, we saw certain problems with our training. Here, we will address them and see if our results improve as we go.

## Problems observed in the previous solution

### Overfitting

A model is considered to overfit when it performs with great accuracy on the training data (data used for training the model), but when evaluated against a test or unseen data set, it performs rather poorly. This happens because our model has overfit the data.

Training accuracy if higher than testing accuracy is a clear indicator of this phenomenon. Thankfully, there are some techniques available to solve this.

#### Model size

First, look at the size of the model, meaning the number of units. If the model used is far bigger than the problem at hand, it is more likely to learn the features/patterns not relevant to the problem and thus overfit to the training data. A larger model will not generalize well, and a smaller model will underfit the data.

Taking the model in our previous article as a baseline, we will evaluate the result of reducing the size and increasing the size on the performance of the model. The following models were tried and compared:

```
baseline_model = keras.models.Sequential([
keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
keras.layers.Dense(128, activation=tf.nn.sigmoid),
keras.layers.Dense(16, activation=tf.nn.sigmoid),
keras.layers.Dense(2, activation=tf.nn.softmax)
])
bigger_model2 = keras.models.Sequential([
keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
keras.layers.Dense(1024, activation=tf.nn.relu),
keras.layers.Dense(512, activation=tf.nn.relu),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(16, activation=tf.nn.relu),
keras.layers.Dense(2, activation=tf.nn.softmax)
])
bigger_model1 = keras.models.Sequential([
keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
keras.layers.Dense(512, activation=tf.nn.relu),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(16, activation=tf.nn.relu),
keras.layers.Dense(2, activation=tf.nn.softmax)
])
smaller_model1 = keras.models.Sequential([
keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(2, activation=tf.nn.softmax)
])
```

To determine the ideal model, we plot loss function of validation data against number of epochs.

Comparison of smaller, bigger, and baseline models.

Comparison of bigger, bigger2, and baseline models.

In these plots, we see that validation loss (`sparse_categorical_crossentropy`

) is almost similar for `bigger`

and `bigger2`

models, and better than `smaller`

and `baseline`

models. So we go ahead and select these models over our `baseline`

model for further tuning.

#### Number of epochs

The number of epochs plays an important role in avoiding overfitting and overall model performance. In the comparison graphs plotted in above section, we observe the loss function for validation data reaches a minimum and on further training, increases again, while loss function of training data reduces further. This is exactly what overfitting means; the model learns patterns specific to the data set and does not generalize well, so it does better with training data than validation data. We have to stop before the model overfits the data. So in above case `epoch`

value of `40`

is ideal.

#### L1 and L2 regularization

The effect of applying L2 regularization is that of adding some random noise to the layers. The plot below shows the effect of applying this on our model.

#### Using Dropout

Keras library provides a dropout layer, a concept introduced in *Dropout: A Simple Way to Prevent Neural Networks from Overfitting(JMLR 2014)*. A consequence of adding a dropout layer is that training time is increased, and if the dropout is high, underfitting.

Models after applying the dropout layers:

```
bigger_model1 = keras.models.Sequential([
keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
keras.layers.Dense(512, activation=tf.nn.relu),
keras.layers.Dropout(0.5),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(16, activation=tf.nn.relu),
keras.layers.Dense(2, activation=tf.nn.softmax)
])
bigger_model2 = keras.models.Sequential([
keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
keras.layers.Dense(1024, activation=tf.nn.relu),
keras.layers.Dropout(0.5),
keras.layers.Dense(512, activation=tf.nn.relu),
keras.layers.Dropout(0.5),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(16, activation=tf.nn.relu),
keras.layers.Dense(2, activation=tf.nn.softmax)
])
```

The following image shows the effect of applying dropout regularization.

During one run, the bigger model did not converge at all, even after 250 epochs. This is one of the side effects of applying dropout regularization.

### Lack of training data

With only 26 or so training examples, we have done reasonably well. But for image processing, there are several techniques of data augmentation by applying some distortion to the original image and generating more data. For example, for every input image, we can have an invert color image added to our data set. To achieve this, the `load_image_dataset`

function from Part 1 is modified as follows (it is also possible to add a randomly rotated image for each original image):

```
# invert_image if true, also stores an invert color version of each image in the training set.
def load_image_dataset(path_dir, maxsize, reshape_size, invert_image=False):
images = []
labels = []
os.chdir(path_dir)
for file in glob.glob("*.jpg"):
img = jpeg_to_8_bit_greyscale(file, maxsize)
inv_image = 255 - img # Generate a invert color image of the original.
if re.match('chihuahua.*', file):
images.append(img.reshape(reshape_size))
labels.append(0)
if invert_image:
labels.append(0)
images.append(inv_image.reshape(reshape_size))
elif re.match('muffin.*', file):
images.append(img.reshape(reshape_size))
labels.append(1)
if invert_image:
images.append(inv_image.reshape(reshape_size))
return (np.asarray(images), np.asarray(labels))
```

The effects of adding invert color images and randomly rotating images on training with dropout on is as follows. The size of the data set increased by 3X.

The result indicates that this has worsened the overfit of the data.

*Note: For data augmentation, Keras provides a built-in utility, keras.preprocessing.image.ImageDataGenerator, which will not be covered here.*

Another way to overcome the problem of minimal training data is to use a pretrained model and augment it with a new training example. This approach is called transfer learning. Since TensorFlow and Keras provide a good mechanism for saving and loading models, this can be quite easily achieved, but out of scope here.

## Conclusion

On further testing with different models and activation functions, the best results were observed by using sigmoid as activation function and a dropout layer in our baseline model. Similar performance was observed with the relu activation function, but with sigmoid, the curve was smoother. And since the size of the image was reduced to 50×50, it improved the training time without impacting the performance of models.

Apart from the above, I have also tested a VGG-style multilayer CNN model and multiple variations of CNN models, but somehow the results were quite poor with it.

The following image shows the plot of the results from all three models.

Baseline model used:

```
baseline_model = keras.models.Sequential([
keras.layers.Flatten(input_shape = ( maxsize_w, maxsize_h , 1)),
keras.layers.Dense(128, activation=tf.nn.sigmoid),
keras.layers.Dropout(0.25),
keras.layers.Dense(16, activation=tf.nn.sigmoid),
keras.layers.Dense(2, activation=tf.nn.softmax)
])
baseline_model.compile(optimizer=keras.optimizers.Adam(lr=0.001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy','sparse_categorical_crossentropy'])
```

Output:

```
- 0s - loss: 0.0217 - acc: 1.0000 - sparse_categorical_crossentropy: 0.0217 - val_loss: 0.2712 - val_acc: 0.9286 - val_sparse_categorical_crossentropy: 0.2712
Epoch 119/400
- 0s - loss: 0.0224 - acc: 1.0000 - sparse_categorical_crossentropy: 0.0224 - val_loss: 0.2690 - val_acc: 0.9286 - val_sparse_categorical_crossentropy: 0.2690
Epoch 120/400
```

Results:

Next, I would like to improve my understandings of CNN and VGG-style networks for image recognition and even more advanced usages of neural networks.