Taxonomy Icon

Data Science

In Part 1 in this series, we gave an overview of this project and explained how we scaled down the images. Part 2 showed how we investigated image filters and determined a set of filters that can be used for effective tissue segmentation with our data set. Part 3 gave an explanation of morphology operators and showed how we combined filters and applied filters to multiple images. This final article in the series gives an explanation of tiling and top tile retrieval.

Tiles

Following our filtering, we should have fairly good tissue segmentation for our data set, where non-tissue pixels have been masked out from our 1/32x scaled-down slide images. At this stage, we break our images into tile regions. Tiling code is located in the wsi/tiles.py file.

For visualization, the tissue percentage of each tile is color-coded in a similar fashion to a heat map. Tiles with 80% or more tissue are green, tiles less than 80% tissue and greater or equal to 10% tissue are yellow, tiles less than 10% tissue and greater than 0% tissue are orange, and tiles with 0% tissue are red.

The heat map threshold values can be adjusted by modifying the TISSUE_HIGH_THRESH and TISSUE_LOW_THRESH constants in the wsi/tiles.py file, which has default values of 80 and 10 respectively. Heat map colors can be adjusted by modifying the HIGH_COLOR, MEDIUM_COLOR, LOW_COLOR, and NONE_COLOR constants. The heat map border size can be adjusted using the TILE_BORDER_SIZE constant, which has a default value of 2. Tile sizes are specified according to number of pixels in the original WSI files. The default ROW_TILE_SIZE and COL_TILE_SIZE values are 1,024 pixels.

To generate and display tiles for a single slide, we use the summary_and_tiles() function, which generates tile summaries and returns the top scoring tiles for a slide. We discuss tile scoring in a later section.

Let’s generate tile tissue heat map summaries for slide #2 and display the summaries to the screen.

tiles.summary_and_tiles(2, display=True, save_summary=True, save_data=False, save_top_tiles=False)

In the following images, we see the tile tissue segmentation heat map summaries that are generated. The heat maps are displayed on the masked image and the original image to allow for comparison.

Tissue heat map Tissue heat map on original
Tissue Heat Map Tissue Heat Map on Original

We see a variety of slide statistics displayed on the tile summaries. We see that slide #2 has dimensions of 57,922×44,329. After scaling down the slide width and height by 1/32x, we have a .png image with dimensions 1,810×1,385. Breaking this image down into 32×32 tiles, we have 57 rows and 44 columns, making a total of 2,508 tiles. Using our tissue segmentation filtering algorithms, we have 1,283 tiles with high tissue percentages (>=80%), 397 tiles with medium tissue percentages (>=10% and <80%), 102=”” tiles=”” with=”” low=”” tissue=”” percentages=”” (=””>0% and <10%), and 726 tiles with no tissue (0%).

Characteristic Result
Original Dimensions 57,922×44,329
Original Tile Size 1,024×1,024
Scale Factor 1/32x
Scaled Dimensions 1,810×1,385
Scaled Tile Size 32×32
Total Mask 41.60%
Total Tissue 58.40%
Tiles 57×44 = 2,508
1,283 (51.16%) tiles >=80% tissue
397 (15.83%) tiles >=10% and <80% tissue
102 ( 4.07%) tiles >0% and <10% tissue
726 (28.95%) tiles =0% tissue

Often, it can be useful to know the exact row and column of a particular tile or tiles. If the DISPLAY_TILE_SUMMARY_LABELS constant is set to True, the row and column of each tile is output on the tile summaries. Generating the tile labels is fairly time-consuming, so usually DISPLAY_TILE_SUMMARY_LABELS should be set to False for performance.

Optional tile labels
Optional Tile Labels

Tile scoring

To selectively choose how “good” a tile is compared to other tiles, we assign scores to tiles based on tissue percentage and color characteristics. To determine the “best” tiles, we sort based on score and return the top scoring tiles. We generate top tile summaries based on the top scoring tiles, in a similar fashion as the tissue percentage summaries.

The score_tile() function assigns a score to a tile based on the tissue percentage and various color characteristics of the tile. The scoring formula utilized by score_tile() can be summarized as follows.

Scoring formula
Scoring Formula

The scoring formula generates good results for the images in the data set and was developed through experimentation with the training data set. The tissuepercent is emphasized by squaring its value. The colorfactor value is used to weigh hematoxylin staining heavier than eosin staining. Utilizing the HSV color model, broad saturation and value distributions are given more weight by the saturationvaluefactor. The quantityfactor value utilizes the tissue percentage to give more weight to tiles with more tissue. Note that if colorfactor, saturationvaluefactor, or quantityfactor evaluate to 0, the score will be 0. The score is scaled to a value from 0.0 to 1.0.

During our discussion of color staining, we mentioned that tissue with hematoxylin staining is most likely preferable to eosin staining. Hematoxylin stains acidic structures such as DNA and RNA with a purple tone, while eosin stains basic structures such as cytoplasm proteins with a pink tone. Let’s discuss how we can more heavily score tiles with hematoxylin staining over eosin staining.

Differentiating purplish shades from pinkish shades can be difficult using the RGB color space (see https://en.wikipedia.org/wiki/RGB_color_space). Therefore, to compute our colorfactor value, we first convert our tile in RGB color space to HSV color space (see https://en.wikipedia.org/wiki/HSL_and_HSV). HSV stands for Hue-Saturation-Value. In this color model, the hue is represented as a degree value on a circle. Purple has a hue of 270 degrees and pink has a hue of 330 degrees. We remove all hues less than 260 and greater than 340. Next, we compute the deviation from purple (270) and the deviation from pink (330). We compute an average factor that is the squared difference of 340 and the hue average. The colorfactor is computed as the pink deviation times the average factor divided by the purple deviation.

Let’s have a closer look at a 32×32 tile and its accompanying HSV hue histogram. Note that to properly convert a matplotlib chart image (the histogram) to a NumPy image on macOS, we currently need to include a call to matplotlib.use('Agg'). One way that we can obtain a particular tile for analysis is to call the dynamic_tile() function, which we describe in more detail later. Here, we obtain the tile at the 29th row and 16th column on slide #2. Setting the small_tile_in_tile parameter to True means that the scaled-down 32×32 tile is included in the returned Tile object. The display_image_with_hsv_hue_histogram() function is used to display the small tile and its hue histogram.

# To get around renderer issue on macOS going from Matplotlib image to NumPy image.
import matplotlib
matplotlib.use('Agg')
from deephistopath.wsi import tiles

tile = tiles.dynamic_tile(2, 29, 16, True)
tiles.display_image_with_hsv_hue_histogram(tile.get_np_scaled_tile(), scale_up=True)

Here, we see the 32×32 slide with its accompanying hue histogram. For convenience, colors have been added to the histogram. Also, notice that the non-tissue masked-out pixels have a peak at 0 degrees.

Tile HSV hue histogram
Tile HSV Hue Histogram

For convenience, the Tile class has a display_with_histograms() function that can be used to display histograms for both the RGB and HSV color spaces. If the scaled-down small tile is included in the Tile object (using the dynamic_tile() small_tile_in_tile parameter with a value of True), histograms will be displayed for both the small tile and the large tile.

import matplotlib
matplotlib.use('Agg')
from deephistopath.wsi import tiles

tile = tiles.dynamic_tile(2, 29, 16, True)
tile.display_with_histograms();

The following image shows RGB and HSV histograms for the scaled-down tile at slide 2, row 29, column 16. We see its score and tissue percentage. This tile’s score was ranked 734 out of a total of 2,508 tiles on this slide.

Small tile color histograms
Small Tile Color Histograms

The following image shows RGB and HSV histograms for the full-sized 1,024×1,024 tile at slide 2, row 29, column 16. Notice that the small tile pixels offer a reasonable approximation of the colors present on the large tile. Also, notice that the masked-out pixels in the small tissue correspond fairly accurately with the non-tissue regions of the large tile.

Large tile color histograms
Large Tile Color Histograms

If the save_data parameter of the summary_and_tiles() function is set to True, detailed data about the slide tiles are saved in a .csv format.

tiles.summary_and_tiles(2, display=True, save_summary=True, save_data=True, save_top_tiles=False)

For slide #2, this generates a TUPAC-TR-002-32x-57922x44329-1810x1385-tile_data.csv file.

Tile data
Tile Data

In addition to the tile tissue heat map summaries, the summary_and_tiles() function generates top tile summaries. By default, it highlights the top 50 scoring tiles. The number of top tiles can be controlled by the NUM_TOP_TILES constant.

tiles.summary_and_tiles(2, display=True, save_summary=True, save_data=False, save_top_tiles=False)

The following image shows the top tile summary on the masked image for slide #2. Notice that tiles with high tissue percentages and hematoxylin-stained tissue are favored over tiles with low tissue percentages and eosin-stained tissue. Notice that statistics about the top 50 scoring tiles are displayed to the right of the image.

Top tiles
Top Tiles

For visual inspection, the top tile summary is also generated over the original slide image, as we see here.

Top tiles on original
Top Tiles on Original

When analyzing top tile results, it can be useful to see the tissue percentage heat map of surrounding tiles. This can be accomplished by setting the BORDER_ALL_TILES_IN_TOP_TILE_SUMMARY constant to True. Likewise, it can useful to see the row and column coordinates of all tiles, which can be accomplished using the LABEL_ALL_TILES_IN_TOP_TILE_SUMMARY constant with a value of True.

Top tile borders Top tile labels
Top Tile Borders Top Tile Labels

The following image shows a section of a top tile summary that features both the tile tissue heat map and the row and column labels.

Top tile labels and borders
Top Tile Labels and Borders

Top tile retrieval

Top tiles can be saved as files in batch mode or retrieved dynamically. In batch mode, tiling, scoring, and saving the 1,000 tissue percentage heat map summaries (2 per image), the 1,000 top tile summaries (2 per image), the 2,000 thumbnails, and 25,000 1Kx1K tiles (50 per image) takes approximately 2 hours.

If the save_top_tiles parameter of the summary_and_tiles() function is set to True, the top-ranking tiles for the specified slide will be saved to the file system.

tiles.summary_and_tiles(2, display=True, save_summary=True, save_data=False, save_top_tiles=True)

In general, it is recommended that you use the singleprocess_filtered_images_to_tiles() and multiprocess_filtered_images_to_tiles() functions in the wsi/tiles.py file. These functions generate convenient HTML pages for investigating the tiles generated for a slide set. The multiprocess_filtered_images_to_tiles() uses multiprocessing for added performance. If no image_num_list parameter is provided, all images in the data set are processed.

In the following code, we generate the top 50 tiles for slides #1, #2, and #3.

tiles.multiprocess_filtered_images_to_tiles(image_num_list=[1, 2, 3])

On the generated tiles.html page, we see the original slide images, the images after filtering, the tissue percentage heat map summaries on the filtered images and the original images, tile summary data including links to the generated .csv file for each slide, the top tile summaries on the filtered images and the original images, and links to the top 50 tile files for each slide.

Tiles page
Tiles Page

The full-size 1,024×1,024 tiles can be investigated using the top tile links. In the following images, we see the two top-scoring tiles on slide 2 at row 34, column 34 and row 35, column 37.

Slide #1, top tile #1 Slide #1, top tile #2
Slide #1, Top Tile #1 Slide #1, Top Tile #2

Tiles can also be retrieved dynamically. In dynamic tile retrieval, slides are scaled down, filtered, tiled, and scored all in-memory. The top tiles can then be retrieved from the original WSI file and stored in-memory. No intermediate files are written to the file system during dynamic tile retrieval.

In the following code, we dynamically obtain a TileSummary object by calling dynamic_tiles() for slide #2. We obtain the top-scoring tiles from tile_summary, outputting status information about each tile. The status information includes the tile number, the row number, the column number, the tissue percentage, and the tile score.

tile_summary = tiles.dynamic_tiles(2)
top_tiles = tile_summary.top_tiles()
for t in top_tiles:
  print(t)

In the console output, we see that the original .svs file is opened, the slide is scaled down, and our series of filters is run on the scaled-down image. After that, the tiles are scored, and we see status information about the top 50 tiles for the slide.

Opening Slide #2: ../data/training_slides/TUPAC-TR-002.svs
RGB                  | Time: 0:00:00.007339  Type: uint8   Shape: (1385, 1810, 3)
Filter Green Channel | Time: 0:00:00.005135  Type: bool    Shape: (1385, 1810)
Mask RGB             | Time: 0:00:00.007973  Type: uint8   Shape: (1385, 1810, 3)
Filter Grays         | Time: 0:00:00.073780  Type: bool    Shape: (1385, 1810)
Mask RGB             | Time: 0:00:00.008114  Type: uint8   Shape: (1385, 1810, 3)
Filter Red Pen       | Time: 0:00:00.066007  Type: bool    Shape: (1385, 1810)
Mask RGB             | Time: 0:00:00.007925  Type: uint8   Shape: (1385, 1810, 3)
Filter Green Pen     | Time: 0:00:00.105854  Type: bool    Shape: (1385, 1810)
Mask RGB             | Time: 0:00:00.008034  Type: uint8   Shape: (1385, 1810, 3)
Filter Blue Pen      | Time: 0:00:00.087092  Type: bool    Shape: (1385, 1810)
Mask RGB             | Time: 0:00:00.007963  Type: uint8   Shape: (1385, 1810, 3)
Mask RGB             | Time: 0:00:00.007807  Type: uint8   Shape: (1385, 1810, 3)
Remove Small Objs    | Time: 0:00:00.034308  Type: bool    Shape: (1385, 1810)
Mask RGB             | Time: 0:00:00.007814  Type: uint8   Shape: (1385, 1810, 3)
[Tile #1915, Row #34, Column #34, Tissue 100.00%, Score 0.8824]
[Tile #1975, Row #35, Column #37, Tissue 100.00%, Score 0.8816]
[Tile #1974, Row #35, Column #36, Tissue 99.90%, Score 0.8811]
[Tile #500, Row #9, Column #44, Tissue 99.32%, Score 0.8797]
[Tile #814, Row #15, Column #16, Tissue 99.22%, Score 0.8795]
[Tile #1916, Row #34, Column #35, Tissue 100.00%, Score 0.8789]
[Tile #1956, Row #35, Column #18, Tissue 99.51%, Score 0.8784]
[Tile #1667, Row #30, Column #14, Tissue 98.63%, Score 0.8783]
[Tile #1839, Row #33, Column #15, Tissue 99.51%, Score 0.8782]
[Tile #1725, Row #31, Column #15, Tissue 99.61%, Score 0.8781]
[Tile #2061, Row #37, Column #9, Tissue 98.54%, Score 0.8779]
[Tile #724, Row #13, Column #40, Tissue 99.90%, Score 0.8778]
[Tile #1840, Row #33, Column #16, Tissue 99.22%, Score 0.8777]
[Tile #758, Row #14, Column #17, Tissue 99.41%, Score 0.8775]
[Tile #1722, Row #31, Column #12, Tissue 98.24%, Score 0.8771]
[Tile #722, Row #13, Column #38, Tissue 99.51%, Score 0.8769]
[Tile #1803, Row #32, Column #36, Tissue 99.22%, Score 0.8769]
[Tile #446, Row #8, Column #47, Tissue 100.00%, Score 0.8768]
[Tile #988, Row #18, Column #19, Tissue 99.61%, Score 0.8767]
[Tile #2135, Row #38, Column #26, Tissue 99.80%, Score 0.8767]
[Tile #704, Row #13, Column #20, Tissue 99.61%, Score 0.8767]
[Tile #816, Row #15, Column #18, Tissue 99.41%, Score 0.8766]
[Tile #1180, Row #21, Column #40, Tissue 99.90%, Score 0.8765]
[Tile #1178, Row #21, Column #38, Tissue 99.80%, Score 0.8765]
[Tile #1042, Row #19, Column #16, Tissue 99.71%, Score 0.8764]
[Tile #1783, Row #32, Column #16, Tissue 99.80%, Score 0.8764]
[Tile #1978, Row #35, Column #40, Tissue 100.00%, Score 0.8763]
[Tile #832, Row #15, Column #34, Tissue 99.61%, Score 0.8762]
[Tile #1901, Row #34, Column #20, Tissue 99.90%, Score 0.8759]
[Tile #701, Row #13, Column #17, Tissue 99.80%, Score 0.8758]
[Tile #817, Row #15, Column #19, Tissue 99.32%, Score 0.8757]
[Tile #2023, Row #36, Column #28, Tissue 100.00%, Score 0.8754]
[Tile #775, Row #14, Column #34, Tissue 99.51%, Score 0.8754]
[Tile #1592, Row #28, Column #53, Tissue 100.00%, Score 0.8753]
[Tile #702, Row #13, Column #18, Tissue 99.22%, Score 0.8753]
[Tile #759, Row #14, Column #18, Tissue 99.51%, Score 0.8752]
[Tile #1117, Row #20, Column #34, Tissue 99.90%, Score 0.8751]
[Tile #1907, Row #34, Column #26, Tissue 99.32%, Score 0.8750]
[Tile #1781, Row #32, Column #14, Tissue 99.61%, Score 0.8749]
[Tile #2250, Row #40, Column #27, Tissue 99.61%, Score 0.8749]
[Tile #1902, Row #34, Column #21, Tissue 99.90%, Score 0.8749]
[Tile #2014, Row #36, Column #19, Tissue 99.22%, Score 0.8749]
[Tile #2013, Row #36, Column #18, Tissue 99.51%, Score 0.8747]
[Tile #1175, Row #21, Column #35, Tissue 99.71%, Score 0.8746]
[Tile #760, Row #14, Column #19, Tissue 99.22%, Score 0.8746]
[Tile #779, Row #14, Column #38, Tissue 99.32%, Score 0.8745]
[Tile #1863, Row #33, Column #39, Tissue 99.71%, Score 0.8745]
[Tile #1899, Row #34, Column #18, Tissue 99.51%, Score 0.8745]
[Tile #778, Row #14, Column #37, Tissue 99.90%, Score 0.8743]
[Tile #1724, Row #31, Column #14, Tissue 99.51%, Score 0.8741]

If we’d like to obtain each tile as a NumPy array, we can do so by calling the get_np_tile() function on the Tile object.

tile_summary = tiles.dynamic_tiles(2)
top_tiles = tile_summary.top_tiles()
for t in top_tiles:
  print(t)
  np_tile = t.get_np_tile()

As a further example, in the following code, we dynamically retrieve the tiles for slide #4 and display the top 2 tiles along with their RGB and HSV histograms.

tile_summary = tiles.dynamic_tiles(4)
top = tile_summary.top_tiles()[:2]
for t in top:
  t.display_with_histograms()
Slide #4, top tile #1 Slide #4, top tile #2
Slide #4, Top Tile #1 Slide #4, Top Tile #2

Next, we dynamically retrieve the tiles for slide #2. We display (not shown) the tile tissue heat map and top tile summaries and then obtain the tiles ordered by tissue percentage. We display the 1,000 and 1,500 tiles by tissue percentage.

tile_summary = tiles.dynamic_tiles(2)
tile_summary.display_summaries()
ts = tile_summary.tiles_by_tissue_percentage()
ts[999].display_with_histograms()
ts[1499].display_with_histograms()

In the following images, we see the 1,000 and 1,500 tiles ordered by tissue percentage for slide #2. Note that the displayed tile rank information is based on score rather than tissue percentage alone.

Slide #2, tissue percentage #1000 Slide #2, tissue percentage #1500
Slide #2, Tissue Percentage #1000 Slide #2, Tissue Percentage #1500

Tiles can be retrieved based on position. In the following code, we display the tiles at row 25, column 30 and row 25, column 31 on slide #2.

tile_summary = tiles.dynamic_tiles(2)
tile_summary.get_tile(25, 30).display_tile()
tile_summary.get_tile(25, 31).display_tile()
Slide #2, row #25, column #30 Slide #2, row #25, column #31
Slide #2, Row #25, Column #30 Slide #2, Row #25, Column #31

If an individual tile is required, the dynamic_tile() function can be used.

tiles.dynamic_tile(2, 25, 32).display_tile()
Slide #2, row #25, column #32
Slide #2, Row #25, Column #32

If multiple tiles need to be retrieved dynamically, for performance reasons dynamic_tiles() is preferable to dynamic_tile().

Summary

In this article series, we’ve taken a look at how Python, in particular with packages such as NumPy and scikit-image, can be used for tissue segmentation in whole-slide images. To efficiently process images in our data set, we utilized OpenSlide to scale down the slides. Using NumPy arrays, we investigated a wide variety of image filters and settled on a combination and series of filters that demonstrated fast, acceptably accurate tissue segmentation for our data set. Following this, we divided the filtered images into tiles and scored the tiles based on tissue percentage and color characteristics such as the degree of hematoxylin staining versus eosin staining. We then demonstrated how we can retrieve the top-scoring tiles that have high tissue percentages and preferred staining characteristics. We saw how whole-slide images could be processed in batches or dynamically. Scaling, filtering, tiling, scoring, and saving the top tiles can be accomplished in batch mode using multiprocessing in the following manner.

slide.multiprocess_training_slides_to_images()
filter.multiprocess_apply_filters_to_images()
tiles.multiprocess_filtered_images_to_tiles()

The previous code generates HTML filter and tile pages that simplify visual inspection of the image processing and the final tile results. Because the average number of pixels per whole-slide image is 7,670,709,629 and we have reduced the data to the top 50 1,024×1,024 pixel tiles, we have reduced the raw image data down by a factor of 146x while identifying tiles that have significant potential for further useful analysis.