1 Introduction
Prostate cancer (PCa) is currently the second most common cancer among men in America. Early detection allows for greater treatment options and a greater chance of treatment success, but while there are several methods of initial screening, a concrete diagnosis of PCa can only be made with a prostate biopsy [Faraj2016]. Tissue samples are currently recorded in highresolution images, called wholeslide images (WSIs). In these images the pathologists analyze the alterations in the stroma and glandular units and, using the Gleason score (GS) system, classify prostate cancer into five progressive levels from 6 to 10 [Li2020]. The higher the grade, the more advanced the cancer. The analysis is mostly a manual task and requires specialized urological pathologists. This specialized staff is not always available, especially in developing countries, and the process is subject to great interobserver variability [Strom2020]. Therefore, several efforts have been made to develop computer assisted diagnosis systems which may facilitate the work of specialists [Bulten2020].
Deep convolutional neural networks (CNN) represent the state of the art in the analysis of visual information, and their implementation in automatic classification models for medical images has been widely studied. However, there is still much research to be done in relation to the diagnostic process in histopathology
[Strom2020]. One of the main problems facing the application of deep learning into medical problems is the limited availability of large databases, given the standard required for the successful training of deep learning models. For histopathology, the previous performed studies have been limited to very small data sets or subsets of Gleason patterns [Strom2020]. In addition, deep learning models approach the prostate tissue grading task as a multiclass or even a binary classification of low risk (67 GS) vs high risk (810 GS) cases [Lara2020]. This has two drawbacks: first, the ordinal information of the grades is not taken into account. Second, the model predictions, usually subject to a softmax activation function, cannot be interpreted as a probability distribution
[vaicenavicius19a], and therefore do not give information about the uncertainty of the predictions which, in safetycritical applications, provides the method with a first level of interpretability.In this paper we approach the prostate tissue grading as an ordinal regression task. We present the Deep Quantum Measurement Ordinal Regression (DQMOR), a deep probabilistic model that combines a CNN with a differentiable probabilistic regression model, the Quantum Measurement Regression (QMR) [Gonzalez2021]. This approach allows us to:

Predict posterior probability distributions over the grades range. Unlike other probabilistic methods as Gaussian processes, these are explicit discrete distributions.

Integrate patchlevel posterior distributions into a single wholeslide image distribution in a simple, yet powerful probabilitybased manner.

Quantify the uncertainty of the predictions. This enrich the model as a diagnostic support tool, by providing it with a first level of interaction and interpretability of the results.
In order to validate our approach, we compare our performance with state of the art deep learningbased methods [Lara2020], and with close related classification and regression methods as the Density Matrix Kernel Density Classification (DMKDC) [Gonzalez2021] and Gaussian processes [Cutajar2017] [Rasmussen2006].
The paper is organized as follows: Section 2 presents a brief overview of the related work. Section 3 presents the theoretical framework of the DQMOR, and Section 4 presents the experimental set up and results. Finally in Section 5 we present the conclusions of this work.
2 Related Work
Classification of prostate cancer images by GS is considered a difficult task even among pathologist, who do not usually agree on their judgment. In recent years, there has been a great research effort to automatically classify PCa. However, most of the previous works focus on classifying prostate WSIs between low and high GS, ignoring the inherent ordinal characteristics of the grading system.
To train a CNN with WSIs, it is required to divide each image into multiple patches, and then, to summarize the information of the patches by different methods, hence, obtaining a prediction of the WSI. In [JimenezdelToro2017], the authors classify patches between low, and high GS, utilizing various CNN architectures and summarizing the patches to a WSI by a GS majority vote. Another approach by Tolkach et al. [Tolkach2020] uses a NASNetLarge CNN, and summarizes the GS of the patches by counting the probabilities per class. In Karimi et al. [Karimi2020]
they proposed training three CNNs for patches of different sizes, and summarizing the probabilities by a logistic regression. In
[Esteban2019], the authors use Gaussian processes based on granulometry descriptors extracted with a CNN for the binary classification task. Some other CNN architectures for GS grading include a combination of an atrous spatial pyramid pooling and a regular CNN as in [Li2020], an Inceptionv3 CNN with a support vector machine (SVM) as in
[Lucas2019], and a DeepLabV3+ with a MobileNet as the backbone [Khani2019].3 Deep Quantum Measurement Ordinal Regression
The overall architecture of the proposed Deep Quantum Measurement Ordinal Regression (DQMOR) is described in Figure 1. We use a Xception CNN [chollet2017] as a patchlevel feature extractor. The extracted features are then used as inputs for the QMR method [Gonzalez2021]. QMR requires an additional feature mapping from the inputs to get a quantum statelike representation. This is made by means of a random Fourier features approach [Rahimi2009RandomMachines]. The regressor yields a discrete posterior probability distribution at patch level. Then, to predict the GS of a single WSI, we summarize the results of the patches into a single posterior distribution from which we get the final grade and an uncertainty measure.
3.1 Feature extraction
We choose as feature extractor the model presented in [Lara2020]
, which is publicly available, and consists of an Xception network trained on ImageNet and finetuned on prostate tissue image patches. This network was originally used for an automatic information fusion model for the automatic binary (lowhigh) classification of WSIs. Taking the output of the last average pooling layer of the model we got a 2048dimensional vector representing each image patch.
3.2 Quantum Measurement Regression
QMR addresses the question on how to use density matrices for regression problems, using random features to encode the inputs in a quantum statelike representation. The model works as a nonparametric density estimator [Gonzalez2021].
3.2.1 Random Fourier Features (RFF)
The RFF method [Rahimi2009RandomMachines] creates a feature map of the data in which the dot product of the samples in the space approximates a shift invariant kernel . The method works by sampling i.i.d. from a probability distribution
given by the Fourier transform of
, and sampling i.i.d.from an uniform distribution in
. In our context, the shift invariant kernel is the Radial Basis Function (RBF) given by,
, where gamma and the number of RFF components are hyperparameters of the models. In our model the RFF works as an embedding layer that maps the features from the Xception module to a representation space that is suitable for the quantum measurement regresion layer.3.2.2 Quantum Measurement Regression (QMR)
QMR [Gonzalez2021] is a differentiable probabilistic regression model that uses a density matrix, , to represent the joint probability distribution of inputs and labels. A QMR layer receives a RFF encoded input sample , and then builds a prediction operator where is the identity operator in , the representation space of the labels. Inference is made by performing a measurement on the training density matrix :
(1) 
Then a partial trace is calculated, which encodes in , with , the posterior probability over the labels. The expected value represents the final prediction .
A gradientbased optimization is allowed by a spectral decomposition of the density matrix,
, in which the number of eigencomponents of the factorization is a hyperparameter of the model. The model is trained by minimizing a meansquarederror loss function with a variance term whose relative importance is controlled by hyperparameter
:(2) 
3.3 WSIs predictions
Since the training of the model is performed at patch level, the evaluation can be done at such level and at the level of the WSI. To get a prediction for a whole image, we explored two approaches: a majority vote procedure (MV), and a probability vote procedure (PV). In the majority vote, the prediction for an image is decided according to the grade with the highest number of predictions among the patches of the image. And in the probability vote, since each patch can be associated with a probability distribution, the normalized summation yields a distribution for the whole image. More formally, thanks to the law of total probability, given an image
, composed by patches, each patch denoted by , the posterior probability of the grade is,(3) 
The final prediction value thus corresponds to the grade with highest probability. In the DQMOR method, one can predicts the expected value of the distribution, but instead, the predictions at patch level were deduced from the probability of each grade per patch , and at WSI level by MV and PV.
4 Experimental Evaluation
4.1 Dataset
We use images from the TCGAPRAD data set, which contains samples of prostate tissue with GS from 6 to 10. This data set is publicly available via The Cancer Genome Atlas (TCGA) [JimenezdelToro2017]. In order to directly compare our results with our baseline [Lara2020] we use the same subset and partition consisting of 141 cases for training, 48 for validation and 46 for testing.
4.2 Experimental Set Up
The feature extraction model is publicly available and the augmentation procedure and training details are described in
[Lara2020]. For the QMR, hyperparameter tuning of the model was performed by generating 25 different random configurations. As result, we created an embedding with 1024 RFF components, 32 eigenvalues and
was set to . For the loss function (See eq. (2)), was set at , and we set a learning rate of .Two extensions of the feature extractor model were set up as baseline for this work. The first dense layer classifier (DLC1) consisted on 1024 neurons with ReLU as the activation function and a dropout of
, followed by 5 neurons with a softmax activation function for the output, and the learning rate was set to , as in the baseline [Lara2020]. The second classifier (DLC2) had two dense layers of and neurons with ReLU activation functions and dropouts of , connected to 5 neurons with a softmax activation function, and the learning rate was set to .We also explored two closely related methods to QMR: Density Matrix Kernel Density Classification (DMKDC) [Gonzalez2021] and Gaussian processes. DMKDC is a differentiable classification method, which applies a RFF feature map to the input sample, and then computes the expected value of the input with a density matrix of each class, returning a posterior probability distribution, which can be optimized with a categorical cross entropy loss function. As with QMR, a hyperparameter random search was performed. We created an embedding with 1024 RFF components, and 32 eigenvalues. was set up at , and we set a learning rate of
. All the previous experiments were performed in Python using the publicly available Kerasbased implementation presented in
[Gonzalez2021].On the other hand, Gaussian processes (GP) [Rasmussen2006]
are another powerful Bayesian approach to regression problems. By means of a kernel covariance matrix, the GP calculates and updates iteratively the probability distribution of all the functions that fit the data, optimizing in the process the kernel parameters. In our case we set the kernel as the RBF. The prediction process consist in marginalizing the learned Gaussian distribution, whose mean would be the actual prediction value, and its standard deviation an uncertainty indicator. We performed experiments with GP using the ScikitLearn implementation in Python. We also explored deep Gaussian processes (DGP), using the implementation proposed in
[Cutajar2017], which also uses RFF to approximate the covariance function. For those experiments, another hyperparameter random search was made, finally setting the number of RFF at 1024 and the learning rate at in a single layer schema.4.3 Results and Discussion
4.3.1 Ordinal Regression
To measure the performance of an ordinal regression method implies to take into account the severity of the misclassified samples. Therefore, in addition to accuracy (ACC) and macro f1 score, we also measured mean absolute error (MAE) on the test partition, at patch level and WSI level. WSI scores were summarized by a MV and PV. The prediction methods at WSIlevel were also applied to the baseline models. In the dense layers classifiers from the softmax output, as in [Tolkach2020]. In the DMKDC, the prediction methods were easily applied because the model outputs a probability distribution. For GP and DGP only MV was calculated, since we have no access to an explicit discrete posterior distribution. The results are reported in Table 1 and Table 2.
Method  Accuracy  Macro F1  MAE 

DLC1 [Lara2020]  0.314  
DLC2 [Lara2020]  
GP [Rasmussen2006]  
DGP [Cutajar2017]  
DMKDC [Gonzalez2021]  0.546  
DQMOR  0.732  

In terms of accuracy at patch level, the DMKDC model obtained the highest results. The best accuracy at WSI level was reached with the DQMOR model with probability vote. The DQMOR also obtained the least mean absolute errors at patch and WSI levels, showing that the model take advantage of the probability distributions and the inherent ordinality of the GS grades.
Method  Accuracy  Macro F1  MAE 

DLC1 MV [Lara2020]  
DLC2 MV [Lara2020]  0.548  
GP MV [Rasmussen2006]  
DGP MV [Cutajar2017]  
DMKDC MV [Gonzalez2021]  
DQMOR MV  0.306  0.713  
DLC1 PV [Lara2020]  
DLC2 PV [Lara2020]  
DMKDC PV [Gonzalez2021]  
DQMOR PV  0.567  0.345  0.730 

4.3.2 Uncertainty Quantification
Beyond the classification and regression performance of the methods, DQMOR allows an uncertainty quantification based on the variance of the predicted distribution. We analyzed the statistical behaviour of the predicted variance on the test set at WSIlevel, grouping the samples according to the absolute error . As expected, DQMOR predicts low uncertainty levels on well classified samples when compared with the missclassified samples (see Figure 2). In fact, the greater the absolute error, the greater the uncertainty. This attribute provides the method with an interpretable mean for the specialist, who may decide whether to trust or not in the model prediction.
5 Conclusions
In this work we approached the prostate tissue grading as an ordinal regression task. We combined the representational power of deep learning with the Quantum Measurement Regression method, which uses density matrices and random features to build a nonparametric density estimator.
The results on classification and regression metrics show that at WSIlevel, DQMOR outperforms similar probabilistic classification and regression methods, as well as extension of the deep base model used for feature extraction. Regarding the analysis of the predicted uncertainty, we showed that DQMOR allows the identification of misclassified examples, and that the higher the misclassification error, the higher the uncertainty. This is a highly valued ability in medical applications, where the aim is to prevent false positives and especially false negatives in a diagnostic processes.
Overall we demonstrate that unlike single deep learning architectures and standard classification models, the combination of CNN’s and QMR allows us to use the ordinal information of the disease grades, and provides a better theoretical framework to combine the patchlevel inference into a single WSI prediction.
Comments
There are no comments yet.