Top tile retrieval
Learn how one team developed algorithms to automatically identify tissues from big whole-slide images
In Part 1 in this series, we gave an overview of this project and explained how we scaled down the images. Part 2 showed how we investigated image filters and determined a set of filters that can be used for effective tissue segmentation with our data set. Part 3 gave an explanation of morphology operators and showed how we combined filters and applied filters to multiple images. This final article in the series gives an explanation of tiling and top tile retrieval.
Following our filtering, we should have fairly good tissue segmentation for our data set, where non-tissue pixels have been masked out from our 1/32x scaled-down slide images. At this stage, we break our images into tile regions. Tiling code is located in the wsi/tiles.py file.
For visualization, the tissue percentage of each tile is color-coded in a similar fashion to a heat map. Tiles with 80% or more tissue are green, tiles less than 80% tissue and greater or equal to 10% tissue are yellow, tiles less than 10% tissue and greater than 0% tissue are orange, and tiles with 0% tissue are red.
The heat map threshold values can be adjusted by modifying the
TISSUE_LOW_THRESH constants in the wsi/tiles.py file, which has default values of 80 and 10
respectively. Heat map colors can be adjusted by modifying the
NONE_COLOR constants. The heat map border size can be adjusted using the
TILE_BORDER_SIZE constant, which has a default value of 2. Tile sizes are specified according to number of pixels in the original WSI files. The default
COL_TILE_SIZE values are 1,024 pixels.
To generate and display tiles for a single slide, we use the
summary_and_tiles() function, which generates tile summaries and returns the top scoring tiles for a slide. We discuss tile scoring in a later section.
Let’s generate tile tissue heat map summaries for slide #2 and display the summaries to the screen.
tiles.summary_and_tiles(2, display=True, save_summary=True, save_data=False, save_top_tiles=False)
In the following images, we see the tile tissue segmentation heat map summaries that are generated. The heat maps are displayed on the masked image and the original image to allow for comparison.
|Tissue heat map||Tissue heat map on original|
We see a variety of slide statistics displayed on the tile summaries. We see that slide #2 has dimensions of 57,922×44,329. After scaling down the slide width and height by 1/32x, we have a .png image with dimensions 1,810×1,385. Breaking this image down into 32×32 tiles, we have 57 rows and 44 columns, making a total of 2,508 tiles. Using our tissue segmentation filtering algorithms, we have 1,283 tiles with high tissue percentages (>=80%), 397 tiles with medium tissue percentages (>=10% and <80%), 102 tiles with low tissue percentages (>0% and <10%), and 726 tiles with no tissue (0%).
|Original Tile Size||1,024×1,024|
|Scaled Tile Size||32×32|
|Tiles||57×44 = 2,508|
|1,283 (51.16%) tiles >=80% tissue|
|397 (15.83%) tiles >=10% and <80% tissue|
|102 ( 4.07%) tiles >0% and <10% tissue|
|726 (28.95%) tiles =0% tissue|
Often, it can be useful to know the exact row and column of a particular tile or tiles. If the
DISPLAY_TILE_SUMMARY_LABELS constant is set to True, the row and column of each tile is
output on the tile summaries. Generating the tile labels is fairly time-consuming, so usually
DISPLAY_TILE_SUMMARY_LABELS should be set to False for performance.
|Optional tile labels|
To selectively choose how “good” a tile is compared to other tiles, we assign scores to tiles based on tissue percentage and color characteristics. To determine the “best” tiles, we sort based on score and return the top scoring tiles. We generate top tile summaries based on the top scoring tiles, in a similar fashion as the tissue percentage summaries.
score_tile() function assigns a score to a tile based on the tissue percentage and various
color characteristics of the tile. The scoring formula utilized by
score_tile() can be summarized
The scoring formula generates good results for the images in the data set and was developed through experimentation with the training data set. The tissuepercent is emphasized by squaring its value. The colorfactor value is used to weigh hematoxylin staining heavier than eosin staining. Utilizing the HSV color model, broad saturation and value distributions are given more weight by the saturationvaluefactor. The quantityfactor value utilizes the tissue percentage to give more weight to tiles with more tissue. Note that if colorfactor, saturationvaluefactor, or quantityfactor evaluate to 0, the score will be 0. The score is scaled to a value from 0.0 to 1.0.
During our discussion of color staining, we mentioned that tissue with hematoxylin staining is most likely preferable to eosin staining. Hematoxylin stains acidic structures such as DNA and RNA with a purple tone, while eosin stains basic structures such as cytoplasm proteins with a pink tone. Let’s discuss how we can more heavily score tiles with hematoxylin staining over eosin staining.
Differentiating purplish shades from pinkish shades can be difficult using the RGB color space (see https://en.wikipedia.org/wiki/RGB_color_space). Therefore, to compute our colorfactor value, we first convert our tile in RGB color space to HSV color space (see https://en.wikipedia.org/wiki/HSL_and_HSV). HSV stands for Hue-Saturation-Value. In this color model, the hue is represented as a degree value on a circle. Purple has a hue of 270 degrees and pink has a hue of 330 degrees. We remove all hues less than 260 and greater than 340. Next, we compute the deviation from purple (270) and the deviation from pink (330). We compute an average factor that is the squared difference of 340 and the hue average. The colorfactor is computed as the pink deviation times the average factor divided by the purple deviation.
Let’s have a closer look at a 32×32 tile and its accompanying HSV hue histogram. Note that to properly convert a matplotlib chart image (the histogram) to a NumPy image on macOS, we currently need to include a call to
matplotlib.use('Agg'). One way that we can obtain a particular tile for analysis is to call
dynamic_tile() function, which we describe in more detail later. Here, we obtain the tile at the 29th row and 16th column on slide #2. Setting the
small_tile_in_tile parameter to
True means that the scaled-down 32×32 tile is included in the returned Tile object. The
display_image_with_hsv_hue_histogram() function is used to display the small tile and its hue
# To get around renderer issue on macOS going from Matplotlib image to NumPy image. import matplotlib matplotlib.use('Agg') from deephistopath.wsi import tiles tile = tiles.dynamic_tile(2, 29, 16, True) tiles.display_image_with_hsv_hue_histogram(tile.get_np_scaled_tile(), scale_up=True)
Here, we see the 32×32 slide with its accompanying hue histogram. For convenience, colors have been added to the histogram. Also, notice that the non-tissue masked-out pixels have a peak at 0 degrees.
|Tile HSV hue histogram|
For convenience, the
Tile class has a
display_with_histograms() function that can be used to display histograms for both the RGB and HSV color spaces. If the scaled-down small tile is included in the Tile object (using the
small_tile_in_tile parameter with a value of
True), histograms will be displayed for both the small tile and the large tile.
import matplotlib matplotlib.use('Agg') from deephistopath.wsi import tiles tile = tiles.dynamic_tile(2, 29, 16, True) tile.display_with_histograms();
The following image shows RGB and HSV histograms for the scaled-down tile at slide 2, row 29, column 16. We see its score and tissue percentage. This tile’s score was ranked 734 out of a total of 2,508 tiles on this slide.
|Small tile color histograms|
The following image shows RGB and HSV histograms for the full-sized 1,024×1,024 tile at slide 2, row 29, column 16. Notice that the small tile pixels offer a reasonable approximation of the colors present on the large tile. Also, notice that the masked-out pixels in the small tissue correspond fairly accurately with the non-tissue regions of the large tile.
|Large tile color histograms|
save_data parameter of the
summary_and_tiles() function is set to
True, detailed data about
the slide tiles are saved in a .csv format.
tiles.summary_and_tiles(2, display=True, save_summary=True, save_data=True, save_top_tiles=False)
For slide #2, this generates a
In addition to the tile tissue heat map summaries, the
summary_and_tiles() function generates
top tile summaries. By default, it highlights the top 50 scoring tiles. The number of top tiles can be
controlled by the
tiles.summary_and_tiles(2, display=True, save_summary=True, save_data=False, save_top_tiles=False)
The following image shows the top tile summary on the masked image for slide #2. Notice that tiles with high tissue percentages and hematoxylin-stained tissue are favored over tiles with low tissue percentages and eosin-stained tissue. Notice that statistics about the top 50 scoring tiles are displayed to the right of the image.
For visual inspection, the top tile summary is also generated over the original slide image, as we see here.
|Top tiles on original|
When analyzing top tile results, it can be useful to see the tissue percentage heat map of surrounding tiles. This can be accomplished by setting the
BORDER_ALL_TILES_IN_TOP_TILE_SUMMARY constant to
True. Likewise, it can useful to see the row and column coordinates of all tiles, which can be accomplished using the
LABEL_ALL_TILES_IN_TOP_TILE_SUMMARY constant with a value of
|Top tile borders||Top tile labels|
The following image shows a section of a top tile summary that features both the tile tissue heat map and the row and column labels.
|Top tile labels and borders|
Top tile retrieval
Top tiles can be saved as files in batch mode or retrieved dynamically. In batch mode, tiling, scoring, and saving the 1,000 tissue percentage heat map summaries (2 per image), the 1,000 top tile summaries (2 per image), the 2,000 thumbnails, and 25,000 1Kx1K tiles (50 per image) takes approximately 2 hours.
save_top_tiles parameter of the
summary_and_tiles() function is set to
the top-ranking tiles for the specified slide will be saved to the file system.
tiles.summary_and_tiles(2, display=True, save_summary=True, save_data=False, save_top_tiles=True)
In general, it is recommended that you use the
multiprocess_filtered_images_to_tiles() functions in the wsi/tiles.py file. These functions
generate convenient HTML pages for investigating the tiles generated for a slide set. The
multiprocess_filtered_images_to_tiles() uses multiprocessing for added performance. If no
image_num_list parameter is provided, all images in the data set are processed.
In the following code, we generate the top 50 tiles for slides #1, #2, and #3.
tiles.multiprocess_filtered_images_to_tiles(image_num_list=[1, 2, 3])
On the generated tiles.html page, we see the original slide images, the images after filtering, the tissue percentage heat map summaries on the filtered images and the original images, tile summary data including links to the generated .csv file for each slide, the top tile summaries on the filtered images and the original images, and links to the top 50 tile files for each slide.
The full-size 1,024×1,024 tiles can be investigated using the top tile links. In the following images, we see the two top-scoring tiles on slide 2 at row 34, column 34 and row 35, column 37.
|Slide #1, top tile #1||Slide #1, top tile #2|
Tiles can also be retrieved dynamically. In dynamic tile retrieval, slides are scaled down, filtered, tiled, and scored all in-memory. The top tiles can then be retrieved from the original WSI file and stored in-memory. No intermediate files are written to the file system during dynamic tile retrieval.
In the following code, we dynamically obtain a
TileSummary object by calling
slide #2. We obtain the top-scoring tiles from
tile_summary, outputting status information about each tile. The status information includes the tile number, the row number, the column number, the tissue percentage, and the tile score.
tile_summary = tiles.dynamic_tiles(2) top_tiles = tile_summary.top_tiles() for t in top_tiles: print(t)
In the console output, we see that the original .svs file is opened, the slide is scaled down, and our series of filters is run on the scaled-down image. After that, the tiles are scored, and we see status information about the top 50 tiles for the slide.
Opening Slide #2: ../data/training_slides/TUPAC-TR-002.svs RGB | Time: 0:00:00.007339 Type: uint8 Shape: (1385, 1810, 3) Filter Green Channel | Time: 0:00:00.005135 Type: bool Shape: (1385, 1810) Mask RGB | Time: 0:00:00.007973 Type: uint8 Shape: (1385, 1810, 3) Filter Grays | Time: 0:00:00.073780 Type: bool Shape: (1385, 1810) Mask RGB | Time: 0:00:00.008114 Type: uint8 Shape: (1385, 1810, 3) Filter Red Pen | Time: 0:00:00.066007 Type: bool Shape: (1385, 1810) Mask RGB | Time: 0:00:00.007925 Type: uint8 Shape: (1385, 1810, 3) Filter Green Pen | Time: 0:00:00.105854 Type: bool Shape: (1385, 1810) Mask RGB | Time: 0:00:00.008034 Type: uint8 Shape: (1385, 1810, 3) Filter Blue Pen | Time: 0:00:00.087092 Type: bool Shape: (1385, 1810) Mask RGB | Time: 0:00:00.007963 Type: uint8 Shape: (1385, 1810, 3) Mask RGB | Time: 0:00:00.007807 Type: uint8 Shape: (1385, 1810, 3) Remove Small Objs | Time: 0:00:00.034308 Type: bool Shape: (1385, 1810) Mask RGB | Time: 0:00:00.007814 Type: uint8 Shape: (1385, 1810, 3) [Tile #1915, Row #34, Column #34, Tissue 100.00%, Score 0.8824] [Tile #1975, Row #35, Column #37, Tissue 100.00%, Score 0.8816] [Tile #1974, Row #35, Column #36, Tissue 99.90%, Score 0.8811] [Tile #500, Row #9, Column #44, Tissue 99.32%, Score 0.8797] [Tile #814, Row #15, Column #16, Tissue 99.22%, Score 0.8795] [Tile #1916, Row #34, Column #35, Tissue 100.00%, Score 0.8789] [Tile #1956, Row #35, Column #18, Tissue 99.51%, Score 0.8784] [Tile #1667, Row #30, Column #14, Tissue 98.63%, Score 0.8783] [Tile #1839, Row #33, Column #15, Tissue 99.51%, Score 0.8782] [Tile #1725, Row #31, Column #15, Tissue 99.61%, Score 0.8781] [Tile #2061, Row #37, Column #9, Tissue 98.54%, Score 0.8779] [Tile #724, Row #13, Column #40, Tissue 99.90%, Score 0.8778] [Tile #1840, Row #33, Column #16, Tissue 99.22%, Score 0.8777] [Tile #758, Row #14, Column #17, Tissue 99.41%, Score 0.8775] [Tile #1722, Row #31, Column #12, Tissue 98.24%, Score 0.8771] [Tile #722, Row #13, Column #38, Tissue 99.51%, Score 0.8769] [Tile #1803, Row #32, Column #36, Tissue 99.22%, Score 0.8769] [Tile #446, Row #8, Column #47, Tissue 100.00%, Score 0.8768] [Tile #988, Row #18, Column #19, Tissue 99.61%, Score 0.8767] [Tile #2135, Row #38, Column #26, Tissue 99.80%, Score 0.8767] [Tile #704, Row #13, Column #20, Tissue 99.61%, Score 0.8767] [Tile #816, Row #15, Column #18, Tissue 99.41%, Score 0.8766] [Tile #1180, Row #21, Column #40, Tissue 99.90%, Score 0.8765] [Tile #1178, Row #21, Column #38, Tissue 99.80%, Score 0.8765] [Tile #1042, Row #19, Column #16, Tissue 99.71%, Score 0.8764] [Tile #1783, Row #32, Column #16, Tissue 99.80%, Score 0.8764] [Tile #1978, Row #35, Column #40, Tissue 100.00%, Score 0.8763] [Tile #832, Row #15, Column #34, Tissue 99.61%, Score 0.8762] [Tile #1901, Row #34, Column #20, Tissue 99.90%, Score 0.8759] [Tile #701, Row #13, Column #17, Tissue 99.80%, Score 0.8758] [Tile #817, Row #15, Column #19, Tissue 99.32%, Score 0.8757] [Tile #2023, Row #36, Column #28, Tissue 100.00%, Score 0.8754] [Tile #775, Row #14, Column #34, Tissue 99.51%, Score 0.8754] [Tile #1592, Row #28, Column #53, Tissue 100.00%, Score 0.8753] [Tile #702, Row #13, Column #18, Tissue 99.22%, Score 0.8753] [Tile #759, Row #14, Column #18, Tissue 99.51%, Score 0.8752] [Tile #1117, Row #20, Column #34, Tissue 99.90%, Score 0.8751] [Tile #1907, Row #34, Column #26, Tissue 99.32%, Score 0.8750] [Tile #1781, Row #32, Column #14, Tissue 99.61%, Score 0.8749] [Tile #2250, Row #40, Column #27, Tissue 99.61%, Score 0.8749] [Tile #1902, Row #34, Column #21, Tissue 99.90%, Score 0.8749] [Tile #2014, Row #36, Column #19, Tissue 99.22%, Score 0.8749] [Tile #2013, Row #36, Column #18, Tissue 99.51%, Score 0.8747] [Tile #1175, Row #21, Column #35, Tissue 99.71%, Score 0.8746] [Tile #760, Row #14, Column #19, Tissue 99.22%, Score 0.8746] [Tile #779, Row #14, Column #38, Tissue 99.32%, Score 0.8745] [Tile #1863, Row #33, Column #39, Tissue 99.71%, Score 0.8745] [Tile #1899, Row #34, Column #18, Tissue 99.51%, Score 0.8745] [Tile #778, Row #14, Column #37, Tissue 99.90%, Score 0.8743] [Tile #1724, Row #31, Column #14, Tissue 99.51%, Score 0.8741]
If we’d like to obtain each tile as a NumPy array, we can do so by calling the
get_np_tile() function on the
tile_summary = tiles.dynamic_tiles(2) top_tiles = tile_summary.top_tiles() for t in top_tiles: print(t) np_tile = t.get_np_tile()
As a further example, in the following code, we dynamically retrieve the tiles for slide #4 and display the top 2 tiles along with their RGB and HSV histograms.
tile_summary = tiles.dynamic_tiles(4) top = tile_summary.top_tiles()[:2] for t in top: t.display_with_histograms()
|Slide #4, top tile #1||Slide #4, top tile #2|
Next, we dynamically retrieve the tiles for slide #2. We display (not shown) the tile tissue heat map and top tile summaries and then obtain the tiles ordered by tissue percentage. We display the 1,000 and 1,500 tiles by tissue percentage.
tile_summary = tiles.dynamic_tiles(2) tile_summary.display_summaries() ts = tile_summary.tiles_by_tissue_percentage() ts.display_with_histograms() ts.display_with_histograms()
In the following images, we see the 1,000 and 1,500 tiles ordered by tissue percentage for slide #2. Note that the displayed tile rank information is based on score rather than tissue percentage alone.
|Slide #2, tissue percentage #1000||Slide #2, tissue percentage #1500|
Tiles can be retrieved based on position. In the following code, we display the tiles at row 25, column 30 and row 25, column 31 on slide #2.
tile_summary = tiles.dynamic_tiles(2) tile_summary.get_tile(25, 30).display_tile() tile_summary.get_tile(25, 31).display_tile()
|Slide #2, row #25, column #30||Slide #2, row #25, column #31|
If an individual tile is required, the
dynamic_tile() function can be used.
tiles.dynamic_tile(2, 25, 32).display_tile()
|Slide #2, row #25, column #32|
If multiple tiles need to be retrieved dynamically, for performance reasons
In this article series, we’ve taken a look at how Python, in particular with packages such as NumPy and scikit-image, can be used for tissue segmentation in whole-slide images. To efficiently process images in our data set, we utilized OpenSlide to scale down the slides. Using NumPy arrays, we investigated a wide variety of image filters and settled on a combination and series of filters that demonstrated fast, acceptably accurate tissue segmentation for our data set. Following this, we divided the filtered images into tiles and scored the tiles based on tissue percentage and color characteristics such as the degree of hematoxylin staining versus eosin staining. We then demonstrated how we can retrieve the top-scoring tiles that have high tissue percentages and preferred staining characteristics. We saw how whole-slide images could be processed in batches or dynamically. Scaling, filtering, tiling, scoring, and saving the top tiles can be accomplished in batch mode using multiprocessing in the following manner.
slide.multiprocess_training_slides_to_images() filter.multiprocess_apply_filters_to_images() tiles.multiprocess_filtered_images_to_tiles()
The previous code generates HTML filter and tile pages that simplify visual inspection of the image processing and the final tile results. Because the average number of pixels per whole-slide image is 7,670,709,629 and we have reduced the data to the top 50 1,024×1,024 pixel tiles, we have reduced the raw image data down by a factor of 146x while identifying tiles that have significant potential for further useful analysis.