Whole-slide image preprocessing in Python
Learn how one team developed algorithms to automatically identify tissues from big whole-slide images
The primary goal of the Tumor Proliferation Assessment Challenge 2016 (TUPAC16) was to develop algorithms to automatically predict breast cancer tumor proliferation scores. In this challenge, the training set consisted of 500 whole-slide images that are scored (1, 2, or 3) by pathologists based on mitosis counts. A higher proliferation score indicates a worse prognosis because higher tumor proliferation rates are correlated with worse outcomes. The tissue samples are stained with hematoxylin and eosin (H&E).
One of our first approaches to this challenge was to apply deep learning to breast cancer whole-slide images, following an approach similar to the process used by Ertosun and Rubin in Automated Grading of Gliomas using Deep Learning in Digital Pathology Images: A modular approach with ensemble of convolutional neural networks. One important part of the technique described by Ertosun and Rubin involves image preprocessing, where large whole-slide images are divided into tiles and only tiles that consist of at least 90% tissue are further analyzed. Tissue is determined by hysteresis thresholding on the grayscale image complement.
The three TUPAC16 challenge tasks were won by Paeng et al, described in A Unified Framework for Tumor Proliferation Score Prediction in Breast Histopathology. In their technique, identification of tissue regions in whole-slide images is done using Otsu thresholding, morphological operations, and binary dilation.
Tissue identification in whole-slide images can be an important precursor to deep learning. Deep learning is computationally expensive and medical whole-slide images are enormous. Typically, a large portion of a slide isn’t useful, such as the background, shadows, water, smudges, and pen marks. You can use preprocessing to rapidly reduce the quantity and increase the quality of the image data to be analyzed. This can lead to faster, more accurate model training.
In this series, we’ll look at whole-slide image processing and will describe various filters that can be used to increase the accuracy of tissue identification. After determining a useful set of filters for tissue segmentation, we’ll divide slides into tiles and determine sets of tiles that typically represent good tissue samples.
The solution should demonstrate high performance, flexibility, and accuracy. Filters should be easy to combine, chain, and modify. Tile scoring should be easy to modify for accurate tile selection. The solution should offer the ability to view filter, tile, and score results across large, unique datasets. The solution should also have the ability to work in a batch mode, where all of the image files and intermediary files are written to the file system, and in a dynamic mode, where high-scoring tissue tiles can be retrieved from the original WSI files without requiring any intermediary files.
In summary, we will scale down whole-slide images (WSI), apply filters to these scaled-down images for tissue segmentation, break the slides into tiles, score the tiles, and then retrieve the top tiles based on their scores.
This project heavily uses Python3. Python is an ideal language for image processing. OpenSlide is used for reading WSI files. Pillow is used for basic image manipulation in Python. NumPy is used for fast, concise, powerful processing of images as NumPy arrays. Scikit-image is heavily used for a wide variety of image functionality, such as morphology, thresholding, and edge detection.
The following are some quick set-up steps on a MacOS system.
Install a package manager like Homebrew.
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install python3
brew install openslide `
Next, we’ll install various useful Python packages using the pip3 package manager. These packages include:
pip3 install -U matplotlib numpy openslide-python Pillow scikit-image scikit-learn scipy
We use scikit-image filters (hysteresis thresholding) in this tutorial that are not present in the latest released version of scikit-image at the time of this writing. We can install scikit-image from the source, as described in the README file.
git clone https://github.com/scikit-image/scikit-image.git cd scikit-image pip3 install -r requirements.txt pip3 install .
Whole-slide imaging background
A whole-slide image is a digital representation of a microscopic slide, typically at a very high level of magnification such as 20x or 40x. As a result of this high magnification, whole slide images are typically very large in size. The maximum file size for a single whole-slide image in our training data set is 3.4 GB, with an average over 1 GB.
|WSI example slide|
A whole-slide image is created by a microscope that scans a slide and combines smaller images into a large image. Techniques include combining scanned square tiles into a whole-slide image and combining scanned strips into a resulting whole-slide image. Occasionally, the smaller constituent images can be visually discerned, as in the shaded area at the top of the following slide.
|Combining smaller images into a whole-slide image|
A fairly unusual feature of whole-slide images is the very large image size. For our training dataset of 500 images, the width varied from 19,920 pixels to 198,220 pixels, with an average of 101,688 pixels. The height varied from 13,347 pixels to 256,256 pixels, with an average of 73,154 pixels. The image total pixel sizes varied from 369,356,640 to 35,621,634,048 pixels, with an average of 7,670,709,628 pixels. The 500 training images take up a total of 525 GB of storage space.
|Training image sizes|
The following images shows a histogram distribution of the training image sizes in megapixels.
|Distribution of images based on number of pixels|
We can use the OpenSlide project to read a variety of whole-slide image formats, including the Aperio *.svs slide format of our training image set. This is a pyramidal, tiled format, where the massive slide is composed of a large number of constituent tiles.
To use the OpenSlide Python interface to view whole slide images, we can clone the OpenSlide Python interface from GitHub and use the included DeepZoom
git clone https://github.com/openslide/openslide-python.git cd openslide-python/examples/deepzoom python3 deepzoom_multiserver.py -Q 100 WSI_DIRECTORY
deepzoom_multiserver.py script starts a web interface on port 5000 and displays the image files at the specified file system location (the
WSI_DIRECTORY value in the previous code, which could be a location such as
~/git/python-wsi-preprocessing/data/). If image files exist in subdirectories, they will also be displayed in the list of available slides.
If this viewing application is installed on a server that also hosts the whole-slide image repository, this offers a convenient mechanism for you to view the slides without requiring local storage space.
|OpenSlide available slides|
The following image shows the initial view of one of the whole-slide images viewed in a web browser.
|OpenSlide whole-slide image|
Using this web interface, the whole-slide image can be zoomed to the highest magnification, revealing fine details at the tile level. Zooming and scrolling operations make it relatively easy to visually peruse the whole-slide image.
|OpenSlide whole-slide image zoomed|
Scale down images
To develop a set of filters that can be applied to an entire set of large whole-slide images, two of the first issues that we are confronted with are the size of the data and the format of the data. As mentioned, for our training dataset, the average .svs file size is over 1 GB and we have a total of 500 images. Additionally, the .svs format is a fairly unusual format that typically can’t be visually displayed by default by common applications and operating systems. Therefore, we will develop some code to overcome these important issues. Using OpenSlide and Python, we’ll convert the training data set to
smaller images in a common format, thus reformulating a big data problem as a small data problem. Before filtering at the entire slide level, we will shrink the width and height down by a factor of 32x, which means we can perform filtering on 1/1024 the image data. Converting 500
svs files to
png files at 1/32 scale takes approximately 12 minutes on a typical MacBook Pro using the following code.
In the wsi/slide.py file, we have many functions that can be used in relation to the original svs images. Of particular importance are the following functions:
open_slide() show_slide() slide_info(display_all_properties=True) slide_stats() training_slide_to_image() singleprocess_training_slides_to_images() multiprocess_training_slides_to_images()
open_slide() function uses OpenSlide to read in a .svs file. The
show_slide() function opens a WSI .svs file and displays a scaled-down version of the slide to the screen. The
slide_info() function displays metadata associated with all of the .svs files. The
slide_stats() function looks at all images and summarizes size information about the set of slides. It also generates a variety of charts for a visual representation of the slide statistics. The
training_slide_to_image() function converts a single .svs slide to a smaller image in a more common format such as .jpg or .png. The
singleprocess_training_slides_to_images() function converts all .svs slides to smaller images, and the
multiprocess_training_slides_to_images() function uses multiple processes (1 process per core) to
speed up the slide conversion process. For the last three functions, when an image is saved, a thumbnail image is also saved. By default, the thumbnail has a maximum height or width of 300 pixels and is a .jpg format.
One of the first actions you can take to become more familiar with the training data set is to look at the metadata associated with each image, which you can do with the
slide_info() function. The following code shows a sample of this metadata for Slide #1:
Opening Slide #1: /Volumes/BigData/TUPAC/training_slides/TUPAC-TR-001.svs Level count: 5 Level dimensions: ((130304, 247552), (32576, 61888), (8144, 15472), (2036, 3868), (1018, 1934)) Level downsamples: (1.0, 4.0, 16.0, 64.0, 128.0) Dimensions: (130304, 247552) Objective power: 40 Associated images: macro: <PIL.Image.Image image mode=RGBA size=497x1014 at 0x114B69F98> thumbnail: <PIL.Image.Image image mode=RGBA size=404x768 at 0x114B69FD0> Format: aperio Properties: Property: aperio.AppMag, value: 40 Property: aperio.MPP, value: 0.16437 Property: openslide.comment, value: Aperio Image Library v11.0.37 130304x247552 (256x256) JPEG/RGB Q=40;Mirax Digital Slide|AppMag = 40|MPP = 0.16437 Property: openslide.level-count, value: 5 Property: openslide.level.downsample, value: 1 Property: openslide.level.height, value: 247552 Property: openslide.level.tile-height, value: 256 Property: openslide.level.tile-width, value: 256 Property: openslide.level.width, value: 130304 Property: openslide.level.downsample, value: 4 Property: openslide.level.height, value: 61888 Property: openslide.level.tile-height, value: 256 Property: openslide.level.tile-width, value: 256 Property: openslide.level.width, value: 32576 Property: openslide.level.downsample, value: 16 Property: openslide.level.height, value: 15472 Property: openslide.level.tile-height, value: 256 Property: openslide.level.tile-width, value: 256 Property: openslide.level.width, value: 8144 Property: openslide.level.downsample, value: 64 Property: openslide.level.height, value: 3868 Property: openslide.level.tile-height, value: 256 Property: openslide.level.tile-width, value: 256 Property: openslide.level.width, value: 2036 Property: openslide.level.downsample, value: 128 Property: openslide.level.height, value: 1934 Property: openslide.level.tile-height, value: 256 Property: openslide.level.tile-width, value: 256 Property: openslide.level.width, value: 1018 Property: openslide.mpp-x, value: 0.16436999999999999 Property: openslide.mpp-y, value: 0.16436999999999999 Property: openslide.objective-power, value: 40 Property: openslide.quickhash-1, value: 0e0631ade42ae3384aaa727ce2e36a8272fe67039c513e17dccfdd592f6040cb Property: openslide.vendor, value: aperio Property: tiff.ImageDescription, value: Aperio Image Library v11.0.37 130304x247552 (256x256) JPEG/RGB Q=40;Mirax Digital Slide|AppMag = 40|MPP = 0.16437 Property: tiff.ResolutionUnit, value: inch
The most important metadata for our purposes is that the slide has a width of 130,304 pixels and a height of 247,552 pixels. Note that these values are displayed as width followed by height. For most of our image processing, we will be using NumPy arrays, where rows (height) are followed by columns (width).
If we visually look over the metadata associated with other images in the training dataset, we see that the slides are not consistent in their various properties such as the number of levels contained in the .svs files. The metadata implies that the data set comes from a variety of sources. The variability in the slides, especially regarding issues such as H&E staining and pen marks on the slides, needs to be considered during our filter development.
If we call the
slide_stats() function, in addition to the charts, we obtain a table of pixel statistics, as shown below.
Training images statistics
|Max width||198,220 pixels||10|
|Max height||256,256 pixels||387|
|Max size||35,621,634,048 pixels||387|
|Min width||19,920 pixels||112|
|Min height||13,347 pixels||108|
|Min size||369,356,640 pixels||112|
|Avg width||101,688 pixels|
|Avg height||73,154 pixels|
|Avg size||7,670,709,629 pixels|
wsi/slide.py file contains constants that can be used to control various image conversion settings. For example, the
SCALE_FACTOR constant controls the factor by which the slides will be scaled down. Its default value is 32, meaning that the height and width will be scaled down by a factor of 32. This means that when we perform filtering, it will be performed on an image 1/1024 the size of the original high-resolution image. The
DEST_TRAIN_EXT constant controls the output format. We will use the default format, .png.
Using macOS, the following conversion times using
multiprocess_training_slides_to_images() on the 500 image training set were obtained:
Training image dataset conversion times
multiprocess_training_slides_to_images() using the .png format, we have 500 scaled-down
whole-slide images in lossless .png format that we will examine in greater detail in relation to our filters.
Image saving, displaying, and conversions
To load, save, and display images, we use the Python Pillow package. In particular, we use the Image module, which contains an Image class used to represent an image. The
wsi/slide.py file contains an
open_image() function to open an image stored in the file system. The
get_training_image_path() function takes a slide number and returns the path to the corresponding training image file, meaning the scaled-down .png file that we created by calling
If we want to convert a single .svs WSI file to a scaled-down .png (without converting all .svs files), open that .png image file as a PIL Image, and display the image to the screen, we can do the following.
slide.training_slide_to_image(4) img_path = slide.get_training_image_path(4) img = slide.open_image(img_path) img.show()
To mathematically manipulate the images, we use NumPy arrays. The wsi/util.py file contains a
pil_to_np_rgb() function that converts a PIL Image to a 3-dimensional NumPy array in RGB format. The first dimension represents the number of rows, the second dimension represents the number of columns, and the third dimension represents the channel (red, green, and blue).
rgb = util.pil_to_np_rgb(img)
For convenience, the
display_img() function can be used to display a NumPy array image. Text can be added to the displayed image, which can be very useful when visually comparing the results of multiple filters.
img_path = slide.get_training_image_path(2) img = slide.open_image(img_path) rgb = util.pil_to_np_rgb(img) util.display_img(rgb, "RGB")
|Display image with text|
When performing operations on NumPy arrays, functions in the wsi/filter.py file will often utilize the
util.np_info() function to display information about the NumPy array and the amount of time required to perform the operation. For example, the previouis call to
pil_to_np_rgb() internally calls
t = Time() rgb = np.asarray(pil_img) np_info(rgb, "RGB", t.elapsed()) return rgb
This call to
np_info() generates console output like the following:
RGB | Time: 0:00:00.162484 Type: uint8 Shape: (1385, 1810, 3)
We see that the PIL-to-NumPy array conversion took 0.16s. The type of the NumPy array is
uint8, which means that each pixel is represented by a red, green, and blue unsigned integer value from 0 to 255. The image has a height of 1385 pixels, a width of 1810 pixels, and three channels (representing red, green, and blue).
We can obtain additional information about NumPy arrays by setting the
util.ADDITIONAL_NP_STATS constant to
True. If we rerun the previous code with
ADDITIONAL_NP_STATS = True, we see the following:
RGB | Time: 0:00:00.157696 Min: 2.00 Max: 255.00 Mean: 182.62 Binary: F Type: uint8 Shape: (1385, 1810, 3)
The minimum value is 2, the maximum value is 255, the mean value is 182.62, and binary is false, meaning that the image is not a binary image. A binary image is an image that consists of only two values (True or False, 1.0 or 0.0, 255 or 0). Binary images are produced by actions such as thresholding.
When interacting with NumPy image processing code, the information provided by
np_info() can be extremely useful. For example, some functions return boolean NumPy arrays, other functions return float NumPy arrays, and other functions might return
uint8 NumPy arrays. Before performing actions on a NumPy array, it’s usually necessary to know the data type of the array and the nature of the data in that array. For performance reasons, normally
ADDITIONAL_NP_STATS should be set to
The wsi/util.py file contains an
np_to_pil() function that converts a NumPy array to a PIL image.
If we have a PIL image, saving the image to the file system can be accomplished by calling the image’s
This first article in the automatic identification of tissues from big whole-slide images series gave you an overview of what our challenge was and explained how to set up the developing environment and scale down the images. Part 2 of this series will explain applying filters and tissue segmentation.