2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Using deep learning to take on the COVID-19 virus

Disclaimer: I am not a medical, radiology, or epidemiology professional. This article was an experiment from an engineering and data scientist perspective, and should be regarded as such.

To apply deep learning for COVID-19, you need a good data set, one with lots of samples, edge cases, metadata, and different images. You want your model to generalize to the data so that it can make accurate predictions on new, unseen data. Unfortunately, not much data is available. There are posts on sites such as LinkedIn and Medium that claim >90%, or in some cases 100%, accuracy on detecting COVID-19 cases. However, these posts usually contain mistakes, which are not always acknowledged. This article tries to mitigate some of the usual mistakes by exploring solutions to the following problems:

  • Testing your model’s performance with the same data you used to train it with. The first rule of machine learning is to never test your model’s performance with the same data you used to train it with. Not using a testing data set, but instead testing and measuring the accuracy of your model on the training data set does not give an accurate representation of how well the model generalizes to new, unseen data. You might not believe this is an issue, but it is a typical fallacy for people new to machine learning and data science.

  • Not using computer vision techniques to achieve better generalization. Augmentation is an absolute necessity, especially in the case where there are very few samples for the model to learn from.

  • Not thinking about what the model learns, that is, will your model actually learn the pattern of what COVID-19 looks like on an X-Ray image, or is it likely that there is another noisy pattern in your data set that it learns instead? One amusing story of early machine learning is called Detecting Tanks. In this story, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days.

  • Not using the correct metrics. If you use the accuracy metric on a heavily imbalanced data set, then your model might look like it’s performing well in the general case, even though it’s performing poorly on COVID-19 cases. 95% accuracy might not be that impressive if your accuracy on COVID-19 cases is 30%.

All of the work for this article is available on GitHub.

Acquiring data

The following list is different data sources that I accumulated over the period of the coronavirus spreading. Currently, this is the best I can get from the internet because the data being collected in many countries and hospitals is classified information. Even if it was not classified information, you still need consent from each patient.

  • GitHub: COVID-19 Chest X-Ray and CT images. This data set is continually updated as more cases appear.
  • Kaggle 1: Chest X-Ray images with both positive and negative cases of pneumonia.
  • Kaggle 2: Chest X-Ray images with positive and negative cases of pneumonia.

Problem with mixed data sources

All of the data available on the internet has not been subjected to the same preprocessing, and all of the images are, as explained below, clearly different in the amount of black bars. Most of the positive COVID-19 data has the entire X-ray take up most of the screen, with little to no black bars on the sides (except for a few). However, the data set with negative COVID-19 cases has mostly black bars on the side of each image.

Images comparison

This becomes a problem because the model might later learn that it just needs to look at the black bars on the side to know whether a new sample is a positive or negative case of COVID-19.

After manual inspection of the data set, it becomes apparent that almost all of the negative cases have these black bars, while approximately 10-20% of the positive cases have black bars. Wouldn’t it be quite a coincidence if the accuracy of predicting positive or negative later turns out to be around the 80-90% mark?

Images markers cropped

In the previous image, the left side shows an image with a negative COVID-19 case (black side and top bars). The image on the right shows the same image, but is cropped toward the center and scaled up to match sizes.

Preprocessing – improving the data

The data problem from earlier is clearly one we need to solve. Can we somehow detect when there are black top, black bottom, or black side bars? And, can we somehow automatically correct this?

The following steps are the key to solving this:

  1. A simple metric: Is there a number x or more black pixels in a given row or column?
  2. Yes? Remove that row or column. Repeat for all rows and columns.
  3. Done checking for black pixels? Then, crop the image.
  4. Next image and repeat.

Ok, not that fast. Here’s a new problem: We need to keep the same aspect ratio. And, we can’t upscale or downscale if the aspect ratio is not the same. This is due to the existing architectures needing an input size of 224×224, which is an aspect ratio of 1×1.

In other words, can we check whether we are removing as many columns as we are removing rows? Can we overcompensate by removing specific rows if there are too many columns with black pixels but little to no rows of black pixels?

Removing the black bars and fixing aspect ratio

The following image demonstrates the input and output of removing the black bars, with one example where it was particularly successful. This is quite powerful, but we are still not adhering to the aspect ratio of 1×1.

Images black bars removed

The most effective and easiest solution that I found to fixing the aspect ratio was calculating the difference between the number of rows removed versus the number of columns removed. Divide the difference by two, and remove rows from top and bottom, or remove columns from left and right.

Images adjusted aspect ratio

The last image is saved as a 224×224 image because we can scale it down now that the aspect ratio is 1×1. This saves space and some time when loading the image again in another script.

Running the script

It takes approximately 14 seconds per image to find the rows and columns for removal. I use a naive approach, which could probably be done more efficiently, nevertheless, it works for this experiment. I had 8851 images in the Kaggle 1 data set, which turned out to be 39 hours of preprocessing. All of the data is released and available in the GitHub repository for this article.

I ran the code on both the Kaggle 1 and GitHub data sets, and the result is that I have all of these images saved in a 224×224 image format. The following code is only for running it on the Kaggle 1 data set, but you can substitute the first part of the code to load any data set of images.

import pandas as pd
import matplotlib.pyplot as plt
import cv2, time, os
import numpy as np

df = pd.read_csv('kaggle-pneumonia-jpg/stage_2_detailed_class_info.csv')
image_folder = 'kaggle-pneumonia-jpg/stage_2_train_images_jpg/'

df = df.loc[df['class'] == 'Normal']

The following code shows how I loaded the COVID-19 data set. Note that the rest of the code is the same. This is just loading a different data set.

columns = ['finding', 'modality', 'filename', 'view']

# only use COVID-19 samples done with X-ray and PA view
df = pd.read_csv('metadata.csv', usecols=columns)
df = df.loc[df['finding'] == 'COVID-19']
df = df.loc[df['modality'] == 'X-ray']
df = df.loc[df['view'] == 'PA']

After loading the dataframe, I load the images, labels, and file names into arrays.

images = []
labels = []
filenames = []

for index, row in df.iterrows():
    image = cv2.imread(image_folder + row.patientId + '.jpg')
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    label = row['class']

    filename = row.patientId

images = np.array(images)

Next, I have a get_black_pixel_indices() function that loops through rows and columns of pixels. You can imagine an image as a table that has rows and columns of pixels that has pixel values.

I count the number of black pixels, defined by the number of pixel values that are less than 40. A pixel value consists of one value for each color channel, that is, red, green, or blue (RGB). I could have an array [39, 67, 45], and the average pixel value would be the average of those numbers, which is 50.33.

When I am finished counting the number of black pixels for a given row and column, I check to see whether there are more black pixels than non-black pixels. If that is the case, I choose to remove that row or column.

def get_black_pixel_indices(img):
    rows_remove = []
    cols_remove = []

    for j, counter in enumerate(range(img.shape[0])):
        row = img[counter]
        # shift column dim to first dim, row dim to second dim
        # acts like a 90 degree counterclock-wise rotation of image
        col = img.transpose(1, 0, 2)[counter]

        row_black_pixel_count = 0
        col_black_pixel_count = 0
        for selector in range(img.shape[1]):
            row_pixel = row[selector]
            col_pixel = col[selector]

            if np.average(row_pixel) < 40:
                row_black_pixel_count += 1
            if np.average(col_pixel) < 40:
                col_black_pixel_count += 1

        if row_black_pixel_count > len(row)/2:
        if col_black_pixel_count > len(col)/2:

    return rows_remove, cols_remove

Before I remove the rows and columns with a higher black pixel count than non black pixel count, I must check for rows and columns that if removed would obstruct the image. The simplest way that I found was that no rows and columns should be removed between column and row 200 and 800. After filtering these obstructing indices, I remove the rows and columns.

def remove_obstructing_indices(rows_remove, cols_remove):
    for i, value in enumerate(rows_remove):
        if 200 <= value <= 800:
            del rows_remove[i]
    for i, value in enumerate(cols_remove):
        if 200 <= value <= 800:
            del cols_remove[i]

    return rows_remove, cols_remove

def remove_black_pixels(img, rows_remove, cols_remove):
    img = np.delete(img, rows_remove, axis=0)
    img = np.delete(img, cols_remove, axis=1)

    return img

The last thing for preprocessing is adjusting the aspect ratio. I know that in most cases a human torso is more tall than wide, so I choose to arbitrarily cut the image from the top and bottom. There are also cases where the torso is wider, but those were usually rare when inspecting the data set.

I start by calculating the column-to-row difference and row-to-column difference, and based on that, I find how many rows or columns that I need to remove to fix the aspect ratio. I found that I can divide the difference by two, then remove half of the difference from either the top and bottom, or left and right. After that, I might have one leftover row or column if the difference is an odd number, so I compensate by either removing a row from the bottom or a column from the right.

def adjust_aspect_ratio(img, rows_remove, cols_remove):
    col_row_diff = len(cols_remove) - len(rows_remove)
    row_col_diff = len(rows_remove) - len(cols_remove)

    if col_row_diff > 0:
        slice_size = int(col_row_diff/2)

        img = img[:-slice_size]
        img = img[slice_size:]

        if img.shape[0] != img.shape[1]:
            img = img[:-1]

    elif row_col_diff > 0:
        slice_size = int(row_col_diff/2)

        img = img[:,:-slice_size,:]
        img = img[:,slice_size:,:]

        if img.shape[0] != img.shape[1]:
            img = img[:,:-1,:]

    if img.shape[0] == img.shape[1]:
        return img, True
        return img, False

Finally, I run the loop that calls the four functions previously described. This results in images that have removed black bars from the left, right, top, and bottom. I downscale that image to a 224×224 image, then I save the image along with the label and file name in a .csv file that cumulates all of the paths and file names of all the preprocessed images.

save_folder = 'normal_dataset/'

start = time.time()
for i, img in enumerate(images):
    rows_remove, cols_remove = get_black_pixel_indices(img)
    rows_remove, cols_remove = remove_obstructing_indices(rows_remove, cols_remove)

    new_img = remove_black_pixels(img, rows_remove, cols_remove)
    adj_img, shape_match = adjust_aspect_ratio(new_img, rows_remove, cols_remove)

    if shape_match:
        resized_image = cv2.resize(adj_img, (224, 224))

        label = labels[i]
        name = filenames[i] + '.jpg'

        cv2.imwrite(save_folder + name, resized_image)
        new_df = pd.DataFrame({'filename': [name], 'finding': [label]})

        if i == 0:
            new_df.to_csv('normal_xray_dataset.csv', mode='a', header=False)

    print('image number {0}, time spent {1:2f}s'.format(i+1, time.time() - start))

For a final overview of another 40 images of negative and positive cases, I can clearly see that the preprocessing approach helped.

Image comparison 2

Augmentation of COVID-19 samples

To make the most of what I have (99 images), I need to use augmentation of the training data. Augmentation is a manipulation of the image, which will look new to CNN. Some common examples of augmentation are rotations, shears, and flips of images.

Keep in mind that I use augmentation to generalize the model better and prevent overfitting. Essentially, each time I use one form of augmentation, I add another similar, but slightly different, image to the training data set. Note that this is also a demonstration of how you could augment the images, and not the augmentation we will use later.

Adjusted contrast

I use TensorFlow’s image.adjust_contrast() function to adjust the contrast of all the images.

X_train_contrast = []

for x in X_train:
    contrast = tf.image.adjust_contrast( x, 2 )

plot_images(X_train_contrast, 'Adjusted Contrast')

This results in the following updated images.

Image adjusted contrast

Adjusted saturation

I use TensorFlow’s image.adjust_saturation() function to adjust the saturation of all the images.

X_train_saturation = []

for x in X_train:
    saturation = tf.image.adjust_saturation( x, 3 )

plot_images(X_train_saturation, 'Adjusted Saturation')

This results in the following updated images.

Image adjusted saturation

Flipping left right

I use TensorFlow’s image.flip_left_right() function to flip all of the images from left to right.

X_train_flipped = []

for x in X_train:
    flipped = tf.image.flip_left_right(x)

plot_images(X_train_flipped, 'Flipped Left Right')

This results in the following updated images.

Image flipped left, right

Flipping up down

I use TensorFlow’s image.flip_up_down function to flip all of the images from top to the bottom.

X_train_flipped_up_down = []

for x in X_train:
    flipped = tf.image.flip_up_down(x)

plot_images(X_train_flipped_up_down, 'Flipped Up Down')

This results in the following updated images.


Flipping up down left right

I use TensorFlow’s image.flip_left_right() function to flip all of the images that are already flipped from top to bottom, which results in the images being flipped from top to bottom, then from left to right.

X_train_flipped_up_down_left_right = []

for x in X_train_flipped_up_down:
    flipped = tf.image.flip_left_right(x)

plot_images(X_train_flipped_up_down_left_right, 'Flipped Up Down Left Right')

This results in the following updated images.



TensorFlow has an extension called TensorFlow Addons that can be installed by the pip install tensorflow-addons pip command. The package has an image.rotate() function that lets you rotate any image by an arbitrary number of degrees. The function takes radians as input, so I use the radians() function from Python’s math import.

import tensorflow_addons as tfa
from math import radians

X_train_rot_45_deg = []
X_train_rot_135_deg = []
X_train_rot_225_deg = []
X_train_rot_315_deg = []

for x in X_train:
    deg_45 = tfa.image.rotate(image, radians(45))
    deg_135 = tfa.image.rotate(image, radians(135))
    deg_225 = tfa.image.rotate(image, radians(225))
    deg_315 = tfa.image.rotate(image, radians(315))


plot_images(X_train_rot_45_deg, 'Rotated 45 Degrees')
plot_images(X_train_rot_135_deg, 'Rotated 135 Degrees')
plot_images(X_train_rot_225_deg, 'Rotated 225 Degrees')
plot_images(X_train_rot_315_deg, 'Rotated 315 Degrees')

This results in the following updated images.

Comparison, part 3

Architecture overview

VGG and ResNet models are particularly popular at the moment because they are performing the best, according to some of the papers released on COVID-19.

  1. Hemdan, E., Shouman M., & Karar M. (2020). COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images. arXiv:2003.11055.
  2. Wang, L., & Wong, A. (2020). COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images. arXiv:2003.09871.

I made an overview of two architectures, VGG19 and ResNet50, using an adapted style guide from this article. The yellow R stands for ReLU, while the yellow S stands for Softmax. Brackets mean repeated blocks of layers. Note that I abstracted batch normalization away.

ResNet50 image

Running a deep learning model

I am going to use transfer learning using the VGG19 model where I initialize the weights to be of those from ImageNet, although it is a very different data set to mine. ImageNet consists of images of random objects, while my data set is just X-Ray images. I am transferring the hidden layers that find colorful patterns from those previously seen objects or their different geometrical forms in another data set.

I ran a short experiment where I tried not using transfer learning, and it resulted in an accuracy and F1 of approximately 0.5-0.6. I chose to go ahead with the approach that performed better, as you see in a moment. If you want to experiment, you must define a VGG19 model with weights=None.

Imports, loading data, and splitting data

For all of the imports for loading data and modeling, I use various packages: pandas, PyPlot, NumPy, cv2, TensorFlow, and scikit-learn.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import cv2, time
import tensorflow as tf

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG19
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense

I start by loading both of the data sets (note that dataset/ in the path variables is the name of the folder of the GitHub repository). I put a constraint on the number of normal cases I use for modeling because using all of the images or even 20% of the normal images leads to overfitting to those samples.

covid_path = 'dataset/covid_dataset.csv'
covid_image_path = 'dataset/covid_adjusted/'

normal_path = 'dataset/normal_xray_dataset.csv'
normal_image_path = 'dataset/normal_dataset/'

covid_df = pd.read_csv(covid_path, usecols=['filename', 'finding'])
normal_df = pd.read_csv(normal_path, usecols=['filename', 'finding'])

normal_df = normal_df.head(99)


The following code loads the images into Python arrays. Finally, I load the images into NumPy arrays and normalize the images so that all of the pixel values are between zero and one.

covid_images = []
covid_labels = []

for index, row in covid_df.iterrows():
    filename = row['filename']
    label = row['finding']
    path = covid_image_path + filename

    image = cv2.imread(path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)


normal_images = []
normal_labels = []

for index, row in normal_df.iterrows():
    filename = row['filename']
    label = row['finding']
    path = normal_image_path + filename

    image = cv2.imread(path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)


# normalize to interval of [0,1]
covid_images = np.array(covid_images) / 255

# normalize to interval of [0,1]
normal_images = np.array(normal_images) / 255

I want to make sure that both the normal and infected cases are split correctly, and because I loaded them separately, I also split them separately. If I later choose to load more normal cases than infected cases, thus loading an imbalanced data set, I will still get a good split in terms of the number of infected cases in the training and testing data sets.

After splitting the data set, I concatenate them into input X and output y, along with a label for training and testing data. I also convert the labels COVID-19 and Normal into numbers by using the LabelBinarizer() and to_categorical() functions.

# split into training and testing
covid_x_train, covid_x_test, covid_y_train, covid_y_test = train_test_split(
    covid_images, covid_labels, test_size=0.2)

normal_x_train, normal_x_test, normal_y_train, normal_y_test = train_test_split(
    normal_images, normal_labels, test_size=0.2)

X_train = np.concatenate((normal_x_train, covid_x_train), axis=0)
X_test = np.concatenate((normal_x_test, covid_x_test), axis=0)
y_train = np.concatenate((normal_y_train, covid_y_train), axis=0)
y_test = np.concatenate((normal_y_test, covid_y_test), axis=0)

# make labels into categories - either 0 or 1
y_train = LabelBinarizer().fit_transform(y_train)
y_train = to_categorical(y_train)

y_test = LabelBinarizer().fit_transform(y_test)
y_test = to_categorical(y_test)

Defining the model

I used a pretrained VGG19 model. I set include_top=False, which means that there are no fully connected layers at the end of the VGG19 architecture. This lets me take the outputs from the convolutions, flatten them, and parse them into my own output layer.

I also added regularization by using a Dropout layer because I was strongly overfitting to the training data. I am still overfitting to the training data, but not as much. I went from approximately a 0.97 accuracy to a 0.88 accuracy.

vggModel = VGG19(weights="imagenet", include_top=False,
    input_tensor=Input(shape=(224, 224, 3)))

outputs = vggModel.output
outputs = Flatten(name="flatten")(outputs)
outputs = Dropout(0.5)(outputs)
outputs = Dense(2, activation="softmax")(outputs)

model = Model(inputs=vggModel.input, outputs=outputs)

for layer in vggModel.layers:
    layer.trainable = False


Defining the augmentation of the training data

The way I chose to augment the data was by generating batches of augmentations on each sample. For each sample, I get a new batch of samples from the ImageDataGenerator I defined in the following code. All of these new augmented samples are used for training, instead of the original sample.

train_aug = ImageDataGenerator(

I chose to set the rotation_range=20, which is the degree range for random rotations of the original image. This means Keras will randomly choose a degree from 1 – 20 and rotate the image by that degree either clockwise or counterclockwise.

Meanwhile, I also have both the width_shift_range=0.2 and height_shift_range=0.2, which randomly shifts the image by the width (that is, to the left and right) and height (that is, up and down) as specified.

Finally, I have horizontal_flip=True, which at random flips the image horizontally (that is, flipping the image from left to right) as you saw earlier.

A good visualization of the ImageDataGenerator function can be found in this blog.

Fitting, predicting, and scoring

I defined the model, augmentation, and data. Now all I must do is to start training the VGG19 model. I can train the model with the augmented data by feeding the batches from the augmentation by the train_aug.flow() function.

history = model.fit(train_aug.flow(X_train, y_train, batch_size=32),
                    steps_per_epoch=len(X_train) / 32,

Because I want to see how well the model performs on the different classes that exist, I use the correct metrics to measure how well the model performs (namely, I F1, and not accuracy). David Ernst explains why I could use both AUC and F1.

“The ROC AUC [or F1] is sensitive to class imbalance in the sense that when there is a minority class, you typically define this as the positive class and it will have a strong impact on the AUC [or F1] value. This is very much desirable behaviour. Accuracy is for example not sensitive in that way. It can be very high even if the minority class is not well predicted at all.”

After training the model, I can use it to make predictions on new and unseen data. I use argmax for finding the highest predicted probability, and then I can generate a classification report using scikit-learn.

from sklearn.metrics import classification_report

y_pred = model.predict(X_test, batch_size=32)
print(classification_report(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)))

The LabelBinarizer from earlier made it such that ‘0’ is the COVID-19 class and ‘1’ is the normal class. I can clearly see that the accuracy is 82%, while the f1-score for COVID-19 is 0.84. This is good considering the limited data. Perhaps this even suggests that there is more noise in the images that I did not filter out or correct for.

              precision    recall  f1-score   support

           0       0.75      0.95      0.84        19
           1       0.93      0.70      0.80        20

    accuracy                           0.82        39
   macro avg       0.84      0.82      0.82        39
weighted avg       0.84      0.82      0.82        39

Now, I can plot the training accuracy and the training loss using PyPlot.



plt.title('Model Accuracy & Loss')

plt.legend(['Accuracy', 'Loss'])


This generates a simple plot that shows how the model’s accuracy increased over time on the training data set as well as the loss on the training data set.

Model ACC loss


Although my approach attempted to mitigate some of the mistakes made in other approaches, I still believe that I can improve it a lot. As you saw from the f1-score, I most likely still have some issues with the data that I can’t immediately see.

The following list gives some of the approaches that I could use to get a better model and more reliable results.

  1. Using X-ray images of SARS, MERS, ARDS, and other types of infections instead of just the normal category. This would enable me to distinguish between different infections.
  2. More quality data of COVID-19 cases from the posteroanterior (PA) view.
  3. Data in different stages of COVID-19: infected, no symptoms; infected, mild symptoms; infected, heavy symptoms.
  4. More advanced types of imaging data, that is, CT images that include more information.
  5. Cross-validation or nested cross-validation for better estimation of the true generalization error of the model.

The results achieved in this article are not to be interpreted as usable in real-world scenarios. You should only consider a deep learning COVID-19 approach after rigorous testing by engineers, doctors, and other professionals, followed by a clinical study.