One clear application of these ML- and DL-based methodologies is to replicate the tasks of pathologists. A well-trained model has the benefits of eliminating interobserver variability amongst pathologists and of achieving a level of throughput impossible to humans. This section will cover research in classification for both cancer and non-cancer pathologies and in segmentation tasks aimed at the identification of structures or cell types within images.
The most popular histopathological application of these methods in gastroenterology and hepatology occurs in the classification of cancers. The general concept is that if these classifications can be made with the human eye, then the models should be able to learn to make such distinctions themselves.
Colorectal: Thakur et al recently published a comprehensive review of artificial intelligence applications in colorectal cancer pathology image analysis, but several papers will still be highlighted in this review.
In an earlier study, Yoon et al trained a customized VGG-based network architecture on 28 normal and 29 colorectal cancer HE-stained slides that were tiled up into 256 × 256 pixel patches. After testing several, custom VGG-based networks, the best model had an accuracy of 93.5% with a sensitivity and specificity of about 95% and 93%, respectively, in determining if an image patch was cancer vs healthy. This study showed promise in the relatively simpler binary classification task of tumor vs normal.
In a study published since, Sena et al took the classification task another step further to train a model to classify between normal mucosa, early neoplastic lesion, adenoma, and cancer in HE-stained samples. The authors used a custom network architecture similar to VGG with four sequential convolutional and pooling layers followed by dense layers. Even with this relatively simple network architecture, the model achieved about a 95% accuracy in predicting the exact label for its larger 864x648 pixel patches across the four classes.
While the above patch-level performances are encouraging, clinical diagnoses are typically at the slide level. To address this, researchers often train additional classifiers, in addition to the patch-level ones, that can make a prediction at the WSI-level by aggregating patch-level information. An example of this is the study by Iizuka et al, which initially trained the Inception-v3 network to classify between non-neoplastic, adenoma, and adenocarcinoma for 512 × 512 pixel patches from HE-stained colorectal and gastric biopsy WSIs. The authors utilized the trained Inception-v3 classifier as a feature extractor to generate 715-length feature vectors, the representations, for each patch. The sequence of feature vectors of every patch in a WSI are used as the input and the WSI-label as the output in training a subsequent RNN. Though RNNs are typically used in temporal data, they have the advantage of being able to take in variable length, sequential inputs in generating final output labels. This is an important feature considering that every WSI has a varying number of total patches sampled. The trained RNN can thus predict the WSI diagnosis by aggregating extracted feature vectors of all patches in that sample. The study achieved WSI-level prediction AUCs of up to 0.97 and 0.99 for gastric adenocarcinoma and adenoma, respectively, and 0.96 and 0.99 for colonic adenocarcinoma and adenoma, respectively. Of note, the gastric classifier model outperformed pathologists in classification accuracy when pathologists were given a 30 s time limit, which is the average amount of time the model takes per WSI. The gastric model achieved an accuracy of 95.6% compared to the 85.89% ± 1.401% (n = 23) for the pathologists.
Russakovsky et al utilized the AlexNet architecture pretrained on ImageNet, a large collection of non-biomedical, natural images with 1000 classes, as a feature extractor for classification and patch-based segmentation tasks on brain and colorectal HE datasets. To address the common lack of annotated training datasets for these DL methodologies, the authors took this approach to demonstrate the potential of CNNs pretrained on non-biomedical images as feature extractors in biomedical applications. For the classification task, CNN-extracted patch representations for each WSI were pooled and condensed by feature selection methods before input into a SVM classifier to generate a WSI prediction. In colorectal cancer classification, the network was trained for a binary classification task to recognize tumor vs normal, and a multiclass classification task to recognize between adenocarcinoma, mucinous carcinoma, serrated carcinoma, papillary carcinoma, cribriform comedo-type adenocarcinoma, and normal. SVM with these CNN features as inputs outperformed SVM with a set of manually extracted feature as inputs in both classification tasks, achieving a 98.0% accuracy in binary and 87.2% in multiclass classification compared to 90.1% and 75.75%, respectively. The segmentation task involved no feature pooling as a SVM was trained to utilize patch-level CNN features to generate a patch classification prediction. By utilizing overlapping patches, pixel-level class predictions can be designated based off an ensemble method aggregating overlapping patch predictions. Again, the SVM with CNN prediction inputs outperformed the SVM with manually extracted feature inputs, showing an overall accuracy of 93.2% compared to 77.0%.
Although the above studies demonstrate the value of patch-level classifications in determining a WSI-level prediction, the annotations required for such a training dataset are highly time-consuming. Additionally, clinically archived tissue specimens are typically accompanied only by the WSI or patient-level diagnosis. MIL encompasses approaches to obtain insight into patches or patch-level features most critical for designation of the WSI-level label. MIL thus represents a possible way to generate effective patch classifier models utilizing only WSI-level annotations.
In MIL, each WSI is considered a bag in which multiple instances, or patches, are contained. If any one of these patches are positive for cancer presence, then the WSI can be determined to be cancer positive. While the instance-level patches have their own classes, these are unprovided or unknown. As such, the goal of MIL is to train an instance-level classifier based on the WSI-level labels to determine these unknown patch labels.
Xu et al applied the MIL-Boost algorithm for HE colorectal slides for binary cancer vs non-cancer classification. In brief, the MIL-Boost algorithm trains the instance-level classifier by “boosting”. “Boosting” refers to the successive training of weak classifiers, where each classifier improves by adding weights to incorrect predictions made by the previous classifier. Here, weak classifier weights are iteratively updated by gradient descent on the bag-level classifier loss function. Backpropagation occurs along patch instances that most negatively affected the predicted bag-level cancer positivity relative to the true WSI-level label and adjusts algorithmic weights to reduce these errors on the next iteration of the weak, instance level classifiers. This process with weak classifiers is repeated until the loss function is minimized and an effective instance-level classifier is developed. The authors demonstrated superior performance of this approach (96.30% accuracy) as opposed to a fully supervised, patch-annotated approach (95.40%) in the binary cancer vs normal classification task.
As patch instances make up a WSI bag, a MIL-type bag representation can be considered to be a collection of patch instance representations where positive instances are provided a higher weight. To this end, Ilse et al utilized a CNN to extract patch feature representations, then incorporated an attention mechanism to output a weighted average of all instances in a bag. Notably, the attention mechanism weights are determined by a two-layer neural network, meaning they are trainable unlike conventional MIL pooling operators that calculate maxes and means. These weighted bag representations can also be used to identify the most important instances for the bag prediction. The authors utilized a published HE colorectal cancer dataset with annotated nuclear patches for epithelial, inflammatory, fibroblast, and miscellaneous and formed the MIL problem so that a bag is considered positive if at least one epithelium-positive patch exists in a WSI. This MIL approach trained an epithelial patch classifier with an accuracy and F1 score of approximately 90% and AUC of 96.8%. Furthermore, the authors could threshold for only instances with high weights, leading to the visualization of epithelial regions in the original HE WSI. Although the focus of this paper was on superior performance of the neural network tunable attention-mechanism relative to fixed alternatives, the final performance metrics lend support to the capacity of MIL approaches in training patch-level classifiers from WSI-level annotations.
Another major part of the colorectal cancer field is in the histopathological evaluation of HE-stained polyps to determine cancerous potential. In 2017, Korbar et al[30,31] trained a ResNet-based network to detect between hyperplastic polyps, sessile serrated polyps, traditional serrated adenoma, tubular adenoma, and tubulovillous/villous adenoma. The authors trained a patch-based classifier, then designated WSI-level predictions according to the patch-level class prediction that was most prevalent in the sample, given that at least 5 patches outputted that prediction. This model achieved a 93.0% overall accuracy [95% confidence interval (CI): 89.0-95.9]. In another study, the authors utilized the same network architecture to identify the 5 classes but focused on implementing Gradient-weighted Class Activation Mapping (Grad-CAM) approaches to address model interpretability. Grad-CAM can backpropagate from a patch’s predicted class label to identify the regions in the input image that contributed most to the prediction. Though this was an early approach, the study showed promising potential for these Grad-CAM approaches to help identify regions of interests (ROIs) that were most influential in the patch-level polyp classification.
Esophageal: While different from Grad-CAM, Tomita et al utilized a related concept in implementing attention-based mechanisms for weakly supervised training to detect 4 classes—normal, Barrett’s esophagus without dysplasia, Barrett’s esophagus with dysplasia, and esophageal adenocarcinoma—from HE-stained esophageal and gastroesophageal junction biopsies. The approach involved breaking up a WSI into patches, from which a CNN would extract features. Each WSI could then be represented as a feature map that is an aggregated patch grid of extracted feature vectors. These feature maps serve as inputs to the attention-based model, the goal of which is to identify the regions of the input feature maps most important for the output label classifications. Therefore, a concept is shared with Grad-CAM in identifying input image regions most influential to the class predictions. Unlike Grad-CAM, the attention-based model will learn to add weights to influential areas in the feature map to aid in final model classification performance. Of note, this process is considered weakly supervised because image output labels are only provided at the WSI-level, as opposed to the patch-level, yet the most influential patch types can be distinguished. The model manages to learn on its own the most salient image features and regions that were most important for the WSI label. The approach here achieved an overall accuracy of 83.0% (95%CI: 80-86) in identifying the 4 classes, outperforming the supervised baseline with an overall accuracy of 76% (95%CI: 73-80) that depends upon extraction of patches from ROI tediously annotated by pathologists. It should be noted, however, that the model achieves an F1 score of 0.59 (95%CI: 0.52-0.66) and the supervised baseline an F1 score of 0.50 (95%CI: 0.43-0.56) possibly indicating a high rate of false positives and negatives.
Moving even further away from supervised learning, Sali et al demonstrated superior performance of unsupervised approaches in classifying HE-stained WSIs to be dysplastic Barrett’s esophagus, non-dysplastic Barrett’s esophagus, and squamous tissue relative to supervised methods. The supervised approach was analogous to Iizuka et al. Training patches labeled by pathologists were used to train the model, then an SVM or RF classifier aggregated the patch-level information for the WSI-level prediction.
The unsupervised feature extraction approach involved a deep convolutional autoencoder (CAE). Deep CAEs are broken up into an encoder and decoder branch. The encoder branch typically applies a series of convolution and pooling operations to act as a feature extractor that outputs a bottleneck feature vector. The decoder branch upsamples back from the bottleneck feature vector and reproduces the original image. Here, the loss function minimizes the differences between the input image and reproduced version, thereby enforcing that the bottleneck feature vector is an effective representation of the input. A helpful analogy is when one zips files on the computer. The process compresses the original file to a smaller memory size (encoding), but then still allows one to re-generate the full-size, original file (decoding). As one knows the zipping mechanism works, he or she can confidently share zipped file versions to others, instead of the larger, original file.
Once the deep CAE is trained, it can be utilized as a feature extractor for all patches in one’s training dataset. Then, by performing clustering approaches, such as k-nearest neighbors (k-NN) or Gaussian mixture models (GMM), on all feature vector-transformed patches, patch types or classes across the dataset can be defined. A SVM or RF classifier can be trained to predict the WSI class by using the relative proportions of the different patch class types in the sample. For WSI-level inference, the deep CAE extracts feature vectors from all patches in the WSI, bins and counts the number of patches per clustering-defined patch type, then utilizes the trained SVM or RF classifier to generate the WSI prediction. This process is called unsupervised because the different types of patches in the WSI are determined by the algorithm independent of any labelling. This is in contrast to the supervised approach, where a CNN was trained to classify between human-defined Barrett’s esophagus, non-dysplastic Barrett’s esophagus, and squamous tissue patch types. The unsupervised GMM method showed good performance with weighted averages for accuracy, AUC, F1, precision, and recall all above 90%. In contrast, the metrics for the supervised approaches ranged from 50%-80%.
Gastric: Though gastric pathologies and cancers will be covered further in other sections of this review, not a tremendous amount of literature exists regarding just classification of gastric cancers. Leon et al demonstrated that, in gastric cancer classification, inputting image patches as a whole into a custom, Keras sequential model shows superior performance than utilizing nuclei extracted from these image patches as input. This may be explained by the fact that the whole image patch contains morphological features that might be important for classification, while the cell input approach sacrifices those portions of the image. The other major study to note is the one by Iizuka et al mentioned earlier, which showed impressive performance in classifying gastric and colorectal adenomas and adenocarcinomas.
Liver: As in the mentioned studies by Iizuka et al and Ilse et al, CNNs can be used to extract patch feature representations. These representations are 1-dimensional vectors comprised of numerical, float values, and higher values can be interpreted as features most important, or highly activated nodes, for the prediction at hand, while lower values may be interpreted as important for the other non-predicted class.
Since these feature values can be reflective of their relative importance in the predicted class, Sun et al used a CNN to extract patch representations from HE-stained WSIs, performed a pooling operation to aggregate patch features at the image level, then sorted the representations to organize activation values from high to low importance in terms of liver cancer prediction. The authors selected a range of top-k and bottom-k features from this sorted list to use in patch representations, driven by the idea that high activations should indicate features important for cancer classifications, while the lower activations should correspond to normal. The variable length representations dependent on k were tested to generate condensed patch representations in training a binary cancer vs normal classifier. A value of 100 for k was deemed optimal, and the authors used the patch classifications to predict WSI cancer vs normal status. The approach achieved an accuracy of 98%, a recall of 1.0, and an F1 score of 0.99.
In addition to the design of effective image classification algorithms, the incorporation of these methodologies into clinical workflow is important to consider. Kiani et al trained a DenseNet CNN to classify between hepatocellular carcinoma and cholangiocarcinoma from HE image patches and developed a diagnostic support tool that outputs predicted classes with probabilities and class activation maps (CAMs) to highlight areas of the input patch important for the prediction. The effects of the diagnostic support tool were analyzed and revealed that, while correct classifier predictions significantly improved accuracy, incorrect classifier predictions significantly decreased accuracy of diagnosing pathologists. Thus, this study highlights the important notion that the damaging effects of incorrect and misleading classifiers need to be strongly considered before clinical implementations.
Pancreatic neuroendocrine: IHC stains are another common technique applied to histopathological samples. The ability to detect specific antigens can be important for the characterization of certain cancer types. In pancreatic neoplasms, for example, the Ki67 stain is used to define proliferative rate and assign grades to pancreatic neuroendocrine tumors (NETs). However, this process is complicated by Ki67 stain positivity in both tumor and non-tumor regions. To address this issue, Niazi et al trained an Inception-v3 network pretrained on ImageNet in a transfer learning setting to detect tumor and non-tumor regions on Ki67-stained pancreatic NET WSIs. As with Xu et al, the concept is that learned features from training on ImageNet should be beneficial within the biomedical setting. By freezing weights on all layers except for the final classification layer, the authors ensured that the feature extraction portion of the network remains unchanged. Training thus affects only the manner in which the classification layer utilizes patch representations instead of affecting the feature extraction itself. The trained model was used to create probability maps for tumor and non-tumor predictions for every pixel in the WSI, then thresholded by 0.5 to generate masks for each class. As each pixel in the image was then assigned to its most probable class, the output generated a segmentation map-type output that is shaded by predicted classes. In identifying tumor and non-tumor regions on a Ki67-stained IHC slide, the model showed about 96%-99% overall accuracy with 97.8% sensitivity and 88.8% specificity.
Cancer lymphocyte interactions: In addition to the cancer itself, other cell types exist within the microenvironment. To address this, Saltz et al trained a VGG-16 network to identify tumor-infiltrating lymphocyte (TIL) containing patches across 13 The Cancer Genome Atlas (TCGA) HE-stained tumor types. The study identified four types of TIL infiltration patterns: Brisk Diffuse, Brisk Band-like, Non-Brisk Multifocal, and Non-Brisk Focal. The study also found associations between TIL infiltration patterns, cancer type, inflammatory response subtype, and molecular cancer subtypes and supports the notion that spatial phenotypes have the exciting potential to correlate with molecular findings.
Cancer nuclei classification: Another avenue of classification tasks in cancer applications has been in the study of nuclei. Pathologists are able to utilize visual, nuclear information, such as aberrant chromatin structures, to identify cancerous cells. Thus, groups have worked on replication this task of nuclei classification.
Chang et al extracted HE-stained nuclei, used immunofluorescence (IF) pan-cytokeratin (panCK) stains aligned to the HE slide by image registration methods to label the HE-extracted nuclei as cancerous or non-cancerous, then trained a CNN to make these distinctions from just an HE input. The panCK-defined cancer positivity approach eliminated the need for tedious, pathologist annotations on the HE images and achieved a 91.3% accuracy with 89.9% sensitivity, 92.8% specificity, and 92.6% precision in classifying cancerous vs non-cancerous nuclei on the independent test set.
Sirinukunwattana et al implemented a spatially constrained CNN to identify pixels most likely to represent the center of nuclei, then trained a subsequent CNN classifier to predict whether the nuclei came from an epithelial, inflammatory, fibroblast, or miscellaneous cell in colon cancer. The authors also implement a Neighboring Ensemble Predictor in the nuclei classifications, which, when predicting the class of a nuclei, incorporated the predictions from all neighboring patches. This approach achieved a weighted average F1 score of 0.784 and AUC of 0.917 in the nucleus classification tasks and a weighted average F1 score of 0.692 in the combined nucleus detection and classification tasks. In a follow up study since, Shapcott et al utilized this nuclei classification algorithm to quantify the four cell types to correlate cellular proportions with different clinical variables in TCGA colorectal cancer patients. This led to findings such as samples with metastasis having more fibroblasts with fewer epithelial and inflammatory cells, samples with residual tumor having more fibroblasts and fewer epithelial and inflammatory cells, and that both venous and vascular invasion were associated with more fibroblasts.
Though much focus in image classification has been in cancers, other image classification applications exist and are highlighted here.
Celiac disease, environmental enteropathy, and nonspecific duodenitis: Wei et al trained a ResNet-based model to classify between celiac disease, normal tissue, and nonspecific duodenitis on HE-stained WSIs with accuracies of 95.3%, 91.0%, and 89.2%, respectively. This was a supervised, patch-based approach for training, and WSIs were predicted to be nonspecific duodenitis if more than 5 patches were classified as such or predicted to be the dominant patch class otherwise.
In a similar supervised fashion, Srivastava et al trained a ResNet model on duodenal HE biopsies to classify between celiac disease, environmental enteropathy, and normal tissue. Patch classifications were aggregated for the WSI prediction and returned an overall 97.6% accuracy.
Sali et al also trained a ResNet model, but for the task of Marsh Score-based grading of celiac disease severity using HE-stained duodenal biopsies. The authors utilized a CAE to generate patch representations, then performed a 2-class k-NN clustering to filter out useless, non-tissue containing patches. The tissue-containing patches were then used for supervised training of the ResNet model to recognize between Marsh scores of I, IIIa, IIIb, and IIIC. Again, patch predictions were aggregated for a WSI-level prediction. The model showed an accuracy and F1 score of around 80-90% for all classes and also implemented CAM approaches to localize certain cell subsets contributing to some of these Marsh Score categories.
In another study, Sali et al took a novel, hierarchical approach towards training a VGG classifier to detect 7 classes: Duodenum-celiac disease, Duodenum-Environmental enteropathy, Duodenum-normal, Ileum-Crohn’s, Ileum-normal, Esophagus-eosinophilic esophagitis, and Esophagus-normal. In addition to having the classifier predict the disease type with the final classification layer, the approach incorporated another output branch in the VGG network to predict anatomic location. The loss function combined outputs of the two branches and enforced the network to learn both anatomic origin and specific disease type. Additionally, the anatomic origin branch occurs before the final classification layer, meaning that the network needs to correctly determine the anatomic origin first, before homing in on the specific diagnosis. Across all 7 classes, the model exhibited F1 scores ranging from 0.714 for Duodenum-normal to 0.950 for Duodenum- Environmental enteropathy.
Helicobacter pylori gastritis and reactive gastropathy: Similar to the other examples, these represent diagnoses that can be made from HE-stained specimens. Martin et al trained the commercially available HALO-AI CNN to classify between Helicobacter pylori, reactive gastropathy, and normal in gastric biopsies. The model achieved sensitivity/specificity pairings of 73.7%/79.6%, 95.7%/100%, 100%/62.5% for normal, Helicobacter pylori, and reactive gastropathy, respectively.
Klein et al developed a model that combines image processing techniques with DL. The authors utilized image processing techniques on both Giemsa- and HE-stained slides to identify potential Helicobacter pylori regions, then had experts review these as being positive or negative for Helicobacter pylori presence. These could then be utilized as input-output pairs to train a VGG-style network. The main goal of this paper, however, was to create a clinical decision support system that utilized the trained model and directs pathologists to Helicobacter pylori hotspots using Grad-CAM-style methodologies. Although this clinical decision support approach showed higher sensitivity than just microscopic diagnosis (100% vs 68.4%), specificity was lower than with just microscopic diagnosis (66.2% vs 92.6%).
Segmentation generally refers to operations that localize and detect cells and structures within a WSI. As pathologists can detect these objects within a sample, the goal is to train models to replicate these tasks.
The gland segmentation in colon histology images challenge contest challenge: A key contributor to the progression of computer vision disciplines has been the presence of challenges that provide a dataset and rank submitted models based off of performance-related quantitative metrics such as F1 scores or AUC values. One example of this is the gland segmentation in colon histology images challenge contest (GlaS) that was held in 2015. These challenges help to stimulate computational disciplines. For one, the announcement of the challenge itself encourages researchers worldwide to address and tackle the problem. Compared to standalone papers, these challenges also have the advantage of pitting the best models against each other to generate a clear benchmark for state-of-the-art performance.
Furthermore, even after completion of the challenge, groups will continue to optimize their algorithms and will have the ability to compare performance to previous high rankers in the challenge. Even since the GlaS challenge, numerous groups have continued to work on gland segmentation models by incorporating novel mechanisms. In 2016, Xu et al added multichannel feature extractions for region and edge probability maps that were then fed into the final CNN for instance segmentation. Also in 2016, BenTaieb et al applied topological and geometric loss functions into their FCN-based model. In 2019, Graham et al introduced a new network component, the minimal information loss unit, that re-introduces resized versions of the original input image to combat the loss of information that accompanies downsampling from the successive convolution and max-pooling operations that occur in neural networks. Most recently in 2020, Zhao et al incorporated spatial attention to weight important spatial locations and channel attention to weight important features to improve gland segmentation performance.
Non-colon gland segmentations: In general, segmentation methodologies require an additional step of development compared to classification tasks. For example, identifying glands in colonic mucosa is an important task but needs additional interpretation to be useful in the clinic. Some possibilities include quantifying the total number of glands or extracting shape-based glandular information to feed into a colorectal cancer classifier. Classification tasks like “Tumor” vs “Healthy”, on the other hand, often already have a clear path towards clinical integration within the pathologist diagnostic workflow.
Reflective of this, many histopathological segmentation studies in gastroenterology and hepatology tend to be focused on optimizing segmentation results themselves, as opposed to continuing onto the translational application. However, high performance segmentations are critical in developing the downstream, clinically impactful algorithms. While some studies have continued onto the next step, the next few years will likely see some more of these segmentation studies bridging into more translational studies.
To highlight some examples, Xiao et al segment out liver portal area components for eventual hepatitis grading. Extraction of features from these segmented structures to train a classifier to grade hepatitis will likely be the next step of this process. Xu et al used a patch-based segmentation approach to identify epithelial and stromal regions in HE-stained breast and epithelial growth factor receptor-stained colon cancer slides as tumor-stroma ratios are recognized to have prognostic value. Here, the next step would be to assess the impact of algorithm-derived epithelium and stroma ratios in patient prognosis or cancer classification. Similarly, to address the eventual use case of segmenting tumors to assess pre-surgical tumor burden, Wang et al used multitask and ensemble learning techniques for pixel-wise HE hepatocellular carcinoma segmentation. For eventual use in computer-assisted diagnosis systems, Qaiser et al develop a fast HE colorectal segmentation algorithm that defines persistent homology profiles to capture morphological differences between normal and cancer nuclei. The emergence of more directly translational follow up studies and validations should be exciting and will be important to monitor.
Moving downstream with segmentation outputs: Some studies have entered this second phase and will be highlighted in this section. Awan et al utilized a modified version of U-Net to perform colon gland segmentation on HE-stained colorectal adenocarcinoma patches, then extracted quantitative measures of glandular aberrance to train a SVM classifier for normal vs tumor classification and for normal vs low grade vs high grade classification. Glandular aberrance correlated with tumor grade, and this method achieved an accuracy of 97% and 91% for the two-class and three-class classifications, respectively. Thus, application of segmentation outputs in this manner can allow for the definition and extraction of novel quantitative features to aid in classification tasks and may provide a look into how these segmentation algorithms will be clinically implemented in the future.
Multiplex IHC (mIHC) involves concurrent histological staining of 6 cell markers or more, and Abousamra et al developed an autoencoder-based color deconvolution algorithm to segment these different stains within a WSI. In a follow-up study, Fassler et al utilized this algorithm on mIHC-stained pancreatic ductal adenocarcinoma (PDAC) WSIs to detect and perform spatial analyses on the cell types. Results indicated that CD16+ myeloid cells dominated the immune microenvironment and on average were of closer distance to tumor cells than CD3+, CD4+, CD8+, or CD20+ lymphocyte populations. In contrast to the study by Awan et al that used segmentation outputs to inform a clinical task, Fassler et al targeted a research application. A pipeline to detect all cell types from mIHC-stained WSIs, quantify, and perform special statistics would serve a wide audience of basic and translational researchers, and, in elevating analytical capacities, may stimulate research output.
A popular translational application of segmentation outputs has been in the field of hepatic steatosis quantification, which is important in the assessment of patients with fatty liver disease or to assess donor liver-quality for transplantation. In an earlier study, Lee et al demonstrated correlation of steatosis quantification by image processing methods on WSIs with MRI measurements, pathologist visual scoring, and several clinical parameters, serving to validate the potential of image feature extraction from WSIs for these applications.
Forlano et al took a ML-based approach to quantify the four histological features used in the Nonalcoholic Steatohepatitis Clinical Research Network (NASH CRN) Scoring System, in an effort to automate the process and assess how their computationally extracted, quantitative histological metrics correlate with the semi-quantitative, categorial metrics of the NASH CRN Scoring System. The authors used image processing techniques to segment out and calculate percentages of fat, inflammation, ballooning, and collagen proportionate area, then fed the values into a binary logistic regression classifier to predict the presence of NASH. The authors argued that the traditional, semiquantitative approaches are outdated, due to their categorical nature and unavoidable interobserver variability, and demonstrated an AUC of 0.802 for their pipeline’s capacity to predict NASH.
Sun et al took a modified VGG-16 patch-based segmentation approach to quantify macrovesicular steatosis in HE-stained frozen, donor liver biopsies. The network was trained on patches extracted from WSIs with steatosis regions annotated by pathologists. As such, the final portion of their network could be trained against the pathologist-annotated steatosis maps to output pixel-wise steatosis prediction maps from HE patch inputs. Steatosis percent could then be calculated by summing steatosis probabilities from the predictions maps and dividing by total tissue area. Overall, the model had a sensitivity of 71.4% and specificity of 97.3% in predicting samples with over 30% steatosis, which is the threshold used by some centers for donor rejection.
Roy et al trained a network to segment foreground steatosis droplet pixels from background, a network to recognize steatosis droplet boundaries, and a third neural network that took both of those outputs as input to generate the final segmentation map. Their segmentation results allowed for the calculation of steatosis pixel percentage (DSP%) and steatosis droplet count percentage (DSC%). DSC% most strongly correlated with histologically determined macrovesicular steatosis percentage (rho = 0.90, P < 0.001) and total steatosis percentage (rho = 0.90, P < 0.001). DSP% showed the best correlation with MRI fat quantification (rho = 0.85, P < 0.001).
Lastly, Salvi et al gained the capacity to quantify both micro- and macrosteatosis on HE-stained liver WSIs. The algorithm achieved an overall accuracy of 97.27% on the test set for steatosis segmentation and showed the lowest average error of 1.07% when comparing automated steatosis quantification with manual quantification methods.