Digital Developer Conference: Hybrid Cloud 2021. On Sep 21, gain free hybrid cloud skills from experts and partners. Register now

Whole-slide image preprocessing in Python

The primary goal of the Tumor Proliferation Assessment Challenge 2016 (TUPAC16) was to develop algorithms to automatically predict breast cancer tumor proliferation scores. In this challenge, the training set consisted of 500 whole-slide images that are scored (1, 2, or 3) by pathologists based on mitosis counts. A higher proliferation score indicates a worse prognosis because higher tumor proliferation rates are correlated with worse outcomes. The tissue samples are stained with hematoxylin and eosin (H&E).

One of our first approaches to this challenge was to apply deep learning to breast cancer whole-slide images, following an approach similar to the process used by Ertosun and Rubin in Automated Grading of Gliomas using Deep Learning in Digital Pathology Images: A modular approach with ensemble of convolutional neural networks. One important part of the technique described by Ertosun and Rubin involves image preprocessing, where large whole-slide images are divided into tiles and only tiles that consist of at least 90% tissue are further analyzed. Tissue is determined by hysteresis thresholding on the grayscale image complement.

The three TUPAC16 challenge tasks were won by Paeng et al, described in A Unified Framework for Tumor Proliferation Score Prediction in Breast Histopathology. In their technique, identification of tissue regions in whole-slide images is done using Otsu thresholding, morphological operations, and binary dilation.

Tissue identification in whole-slide images can be an important precursor to deep learning. Deep learning is computationally expensive and medical whole-slide images are enormous. Typically, a large portion of a slide isn’t useful, such as the background, shadows, water, smudges, and pen marks. You can use preprocessing to rapidly reduce the quantity and increase the quality of the image data to be analyzed. This can lead to faster, more accurate model training.

In this series, we’ll look at whole-slide image processing and will describe various filters that can be used to increase the accuracy of tissue identification. After determining a useful set of filters for tissue segmentation, we’ll divide slides into tiles and determine sets of tiles that typically represent good tissue samples.

The solution should demonstrate high performance, flexibility, and accuracy. Filters should be easy to combine, chain, and modify. Tile scoring should be easy to modify for accurate tile selection. The solution should offer the ability to view filter, tile, and score results across large, unique datasets. The solution should also have the ability to work in a batch mode, where all of the image files and intermediary files are written to the file system, and in a dynamic mode, where high-scoring tissue tiles can be retrieved from the original WSI files without requiring any intermediary files.

In summary, we will scale down whole-slide images (WSI), apply filters to these scaled-down images for tissue segmentation, break the slides into tiles, score the tiles, and then retrieve the top tiles based on their scores.

Five steps
5 Steps


This project heavily uses Python3. Python is an ideal language for image processing. OpenSlide is used for reading WSI files. Pillow is used for basic image manipulation in Python. NumPy is used for fast, concise, powerful processing of images as NumPy arrays. Scikit-image is heavily used for a wide variety of image functionality, such as morphology, thresholding, and edge detection.

The following are some quick set-up steps on a MacOS system.

  1. Install a package manager like Homebrew.

     /usr/bin/ruby -e "$(curl -fsSL"
  2. Install Python3.

     brew install python3
  3. Install OpenSlide. Note that OpenSlide is licensed under the LGPL 2.1 License.

     brew install openslide

Next, we’ll install various useful Python packages using the pip3 package manager. These packages include:

    pip3 install -U matplotlib numpy openslide-python Pillow scikit-image scikit-learn scipy

We use scikit-image filters (hysteresis thresholding) in this tutorial that are not present in the latest released version of scikit-image at the time of this writing. We can install scikit-image from the source, as described in the README file.

    git clone
    cd scikit-image
    pip3 install -r requirements.txt
    pip3 install .

Whole-slide imaging background

A whole-slide image is a digital representation of a microscopic slide, typically at a very high level of magnification such as 20x or 40x. As a result of this high magnification, whole slide images are typically very large in size. The maximum file size for a single whole-slide image in our training data set is 3.4 GB, with an average over 1 GB.

WSI example slide
WSI Example Slide

A whole-slide image is created by a microscope that scans a slide and combines smaller images into a large image. Techniques include combining scanned square tiles into a whole-slide image and combining scanned strips into a resulting whole-slide image. Occasionally, the smaller constituent images can be visually discerned, as in the shaded area at the top of the following slide.

Combining smaller images into a whole-slide image
Combining Smaller Images into a Whole-Slide Image

A fairly unusual feature of whole-slide images is the very large image size. For our training dataset of 500 images, the width varied from 19,920 pixels to 198,220 pixels, with an average of 101,688 pixels. The height varied from 13,347 pixels to 256,256 pixels, with an average of 73,154 pixels. The image total pixel sizes varied from 369,356,640 to 35,621,634,048 pixels, with an average of 7,670,709,628 pixels. The 500 training images take up a total of 525 GB of storage space.

Training image sizes
Training Image Sizes

The following images shows a histogram distribution of the training image sizes in megapixels.

Distribution of images based on number of pixels
Distribution of Image Sizes

We can use the OpenSlide project to read a variety of whole-slide image formats, including the Aperio *.svs slide format of our training image set. This is a pyramidal, tiled format, where the massive slide is composed of a large number of constituent tiles.

To use the OpenSlide Python interface to view whole slide images, we can clone the OpenSlide Python interface from GitHub and use the included DeepZoom script.

    git clone
    cd openslide-python/examples/deepzoom
    python3 -Q 100 WSI_DIRECTORY

The script starts a web interface on port 5000 and displays the image files at the specified file system location (the WSI_DIRECTORY value in the previous code, which could be a location such as ~/git/python-wsi-preprocessing/data/). If image files exist in subdirectories, they will also be displayed in the list of available slides.

If this viewing application is installed on a server that also hosts the whole-slide image repository, this offers a convenient mechanism for you to view the slides without requiring local storage space.

OpenSlide available slides
OpenSlide Available Slides

The following image shows the initial view of one of the whole-slide images viewed in a web browser.

OpenSlide whole-slide image
OpenSlide Whole Slide Image

Using this web interface, the whole-slide image can be zoomed to the highest magnification, revealing fine details at the tile level. Zooming and scrolling operations make it relatively easy to visually peruse the whole-slide image.

OpenSlide whole-slide image zoomed
OpenSlide Whole Slide Image Zoomed

Scale down images

To develop a set of filters that can be applied to an entire set of large whole-slide images, two of the first issues that we are confronted with are the size of the data and the format of the data. As mentioned, for our training dataset, the average .svs file size is over 1 GB and we have a total of 500 images. Additionally, the .svs format is a fairly unusual format that typically can’t be visually displayed by default by common applications and operating systems. Therefore, we will develop some code to overcome these important issues. Using OpenSlide and Python, we’ll convert the training data set to smaller images in a common format, thus reformulating a big data problem as a small data problem. Before filtering at the entire slide level, we will shrink the width and height down by a factor of 32x, which means we can perform filtering on 1/1024 the image data. Converting 500 svs files to png files at 1/32 scale takes approximately 12 minutes on a typical MacBook Pro using the following code.

In the wsi/ file, we have many functions that can be used in relation to the original svs images. Of particular importance are the following functions:


The open_slide() function uses OpenSlide to read in a .svs file. The show_slide() function opens a WSI .svs file and displays a scaled-down version of the slide to the screen. The slide_info() function displays metadata associated with all of the .svs files. The slide_stats() function looks at all images and summarizes size information about the set of slides. It also generates a variety of charts for a visual representation of the slide statistics. The training_slide_to_image() function converts a single .svs slide to a smaller image in a more common format such as .jpg or .png. The singleprocess_training_slides_to_images() function converts all .svs slides to smaller images, and the multiprocess_training_slides_to_images() function uses multiple processes (1 process per core) to speed up the slide conversion process. For the last three functions, when an image is saved, a thumbnail image is also saved. By default, the thumbnail has a maximum height or width of 300 pixels and is a .jpg format.

One of the first actions you can take to become more familiar with the training data set is to look at the metadata associated with each image, which you can do with the slide_info() function. The following code shows a sample of this metadata for Slide #1:

Opening Slide #1: /Volumes/BigData/TUPAC/training_slides/TUPAC-TR-001.svs
Level count: 5
Level dimensions: ((130304, 247552), (32576, 61888), (8144, 15472), (2036, 3868), (1018, 1934))
Level downsamples: (1.0, 4.0, 16.0, 64.0, 128.0)
Dimensions: (130304, 247552)
Objective power: 40
Associated images:
  macro: <PIL.Image.Image image mode=RGBA size=497x1014 at 0x114B69F98>
  thumbnail: <PIL.Image.Image image mode=RGBA size=404x768 at 0x114B69FD0>
Format: aperio
  Property: aperio.AppMag, value: 40
  Property: aperio.MPP, value: 0.16437
  Property: openslide.comment, value: Aperio Image Library v11.0.37
130304x247552 (256x256) JPEG/RGB Q=40;Mirax Digital Slide|AppMag = 40|MPP = 0.16437
  Property: openslide.level-count, value: 5
  Property: openslide.level[0].downsample, value: 1
  Property: openslide.level[0].height, value: 247552
  Property: openslide.level[0].tile-height, value: 256
  Property: openslide.level[0].tile-width, value: 256
  Property: openslide.level[0].width, value: 130304
  Property: openslide.level[1].downsample, value: 4
  Property: openslide.level[1].height, value: 61888
  Property: openslide.level[1].tile-height, value: 256
  Property: openslide.level[1].tile-width, value: 256
  Property: openslide.level[1].width, value: 32576
  Property: openslide.level[2].downsample, value: 16
  Property: openslide.level[2].height, value: 15472
  Property: openslide.level[2].tile-height, value: 256
  Property: openslide.level[2].tile-width, value: 256
  Property: openslide.level[2].width, value: 8144
  Property: openslide.level[3].downsample, value: 64
  Property: openslide.level[3].height, value: 3868
  Property: openslide.level[3].tile-height, value: 256
  Property: openslide.level[3].tile-width, value: 256
  Property: openslide.level[3].width, value: 2036
  Property: openslide.level[4].downsample, value: 128
  Property: openslide.level[4].height, value: 1934
  Property: openslide.level[4].tile-height, value: 256
  Property: openslide.level[4].tile-width, value: 256
  Property: openslide.level[4].width, value: 1018
  Property: openslide.mpp-x, value: 0.16436999999999999
  Property: openslide.mpp-y, value: 0.16436999999999999
  Property: openslide.objective-power, value: 40
  Property: openslide.quickhash-1, value: 0e0631ade42ae3384aaa727ce2e36a8272fe67039c513e17dccfdd592f6040cb
  Property: openslide.vendor, value: aperio
  Property: tiff.ImageDescription, value: Aperio Image Library v11.0.37
130304x247552 (256x256) JPEG/RGB Q=40;Mirax Digital Slide|AppMag = 40|MPP = 0.16437
  Property: tiff.ResolutionUnit, value: inch

The most important metadata for our purposes is that the slide has a width of 130,304 pixels and a height of 247,552 pixels. Note that these values are displayed as width followed by height. For most of our image processing, we will be using NumPy arrays, where rows (height) are followed by columns (width).

If we visually look over the metadata associated with other images in the training dataset, we see that the slides are not consistent in their various properties such as the number of levels contained in the .svs files. The metadata implies that the data set comes from a variety of sources. The variability in the slides, especially regarding issues such as H&E staining and pen marks on the slides, needs to be considered during our filter development.

If we call the slide_stats() function, in addition to the charts, we obtain a table of pixel statistics, as shown below.

Training images statistics

Attribute Size Slide #
Max width 198,220 pixels 10
Max height 256,256 pixels 387
Max size 35,621,634,048 pixels 387
Min width 19,920 pixels 112
Min height 13,347 pixels 108
Min size 369,356,640 pixels 112
Avg width 101,688 pixels
Avg height 73,154 pixels
Avg size 7,670,709,629 pixels

The wsi/ file contains constants that can be used to control various image conversion settings. For example, the SCALE_FACTOR constant controls the factor by which the slides will be scaled down. Its default value is 32, meaning that the height and width will be scaled down by a factor of 32. This means that when we perform filtering, it will be performed on an image 1/1024 the size of the original high-resolution image. The DEST_TRAIN_EXT constant controls the output format. We will use the default format, .png.

Using macOS, the following conversion times using singleprocess_training_slides_to_images() and multiprocess_training_slides_to_images() on the 500 image training set were obtained:

Training image dataset conversion times

Format Processes Time
jpg single process 26m09s
jpg multi process 10m21s
png single process 42m59s
png multi process 11m58s

After calling multiprocess_training_slides_to_images() using the .png format, we have 500 scaled-down whole-slide images in lossless .png format that we will examine in greater detail in relation to our filters.

Image saving, displaying, and conversions

To load, save, and display images, we use the Python Pillow package. In particular, we use the Image module, which contains an Image class used to represent an image. The wsi/ file contains an open_image() function to open an image stored in the file system. The get_training_image_path() function takes a slide number and returns the path to the corresponding training image file, meaning the scaled-down .png file that we created by calling multiprocess_training_slides_to_images().

If we want to convert a single .svs WSI file to a scaled-down .png (without converting all .svs files), open that .png image file as a PIL Image, and display the image to the screen, we can do the following.

img_path = slide.get_training_image_path(4)
img = slide.open_image(img_path)

To mathematically manipulate the images, we use NumPy arrays. The wsi/ file contains a pil_to_np_rgb() function that converts a PIL Image to a 3-dimensional NumPy array in RGB format. The first dimension represents the number of rows, the second dimension represents the number of columns, and the third dimension represents the channel (red, green, and blue).

rgb = util.pil_to_np_rgb(img)

For convenience, the display_img() function can be used to display a NumPy array image. Text can be added to the displayed image, which can be very useful when visually comparing the results of multiple filters.

img_path = slide.get_training_image_path(2)
img = slide.open_image(img_path)
rgb = util.pil_to_np_rgb(img)
util.display_img(rgb, "RGB")
Display image with text
Display Image with Text

When performing operations on NumPy arrays, functions in the wsi/ file will often utilize the util.np_info() function to display information about the NumPy array and the amount of time required to perform the operation. For example, the previouis call to pil_to_np_rgb() internally calls np_info():

t = Time()
rgb = np.asarray(pil_img)
np_info(rgb, "RGB", t.elapsed())
return rgb

This call to np_info() generates console output like the following:

RGB                  | Time: 0:00:00.162484  Type: uint8   Shape: (1385, 1810, 3)

We see that the PIL-to-NumPy array conversion took 0.16s. The type of the NumPy array is uint8, which means that each pixel is represented by a red, green, and blue unsigned integer value from 0 to 255. The image has a height of 1385 pixels, a width of 1810 pixels, and three channels (representing red, green, and blue).

We can obtain additional information about NumPy arrays by setting the util.ADDITIONAL_NP_STATS constant to True. If we rerun the previous code with ADDITIONAL_NP_STATS = True, we see the following:

RGB                  | Time: 0:00:00.157696 Min:   2.00  Max: 255.00  Mean: 182.62  Binary: F  Type: uint8   Shape: (1385, 1810, 3)

The minimum value is 2, the maximum value is 255, the mean value is 182.62, and binary is false, meaning that the image is not a binary image. A binary image is an image that consists of only two values (True or False, 1.0 or 0.0, 255 or 0). Binary images are produced by actions such as thresholding.

When interacting with NumPy image processing code, the information provided by np_info() can be extremely useful. For example, some functions return boolean NumPy arrays, other functions return float NumPy arrays, and other functions might return uint8 NumPy arrays. Before performing actions on a NumPy array, it’s usually necessary to know the data type of the array and the nature of the data in that array. For performance reasons, normally ADDITIONAL_NP_STATS should be set to False.

The wsi/ file contains an np_to_pil() function that converts a NumPy array to a PIL image.

If we have a PIL image, saving the image to the file system can be accomplished by calling the image’s save() function.


This first article in the automatic identification of tissues from big whole-slide images series gave you an overview of what our challenge was and explained how to set up the developing environment and scale down the images. Part 2 of this series will explain applying filters and tissue segmentation.