Yang YJ, Bang CS. Application of artificial intelligence in gastroenterology. World J Gastroenterol 2019; 25(14): 1666-1683 [PMID: 31011253 DOI: 10.3748/wjg.v25.i14.1666]
Corresponding Author of This Article
Chang Seok Bang, MD, PhD, Assistant Professor, Doctor, Department of Internal Medicine, Hallym University College of Medicine, Sakju-ro 77, Chuncheon, Gangwon-do 24253, South Korea. firstname.lastname@example.org
Checklist of Responsibilities for the Scientific Editor of This Article
This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Author contributions: Yang YJ collected the data and drafted the manuscript. Bang CS conceptualized, collected the data, drafted the manuscript, performed critical revision and approved the final manuscript.
Conflict-of-interest statement: The authors declare no conflicts of interest.
Open-Access: This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Corresponding author: Chang Seok Bang, MD, PhD, Assistant Professor, Doctor, Department of Internal Medicine, Hallym University College of Medicine, Sakju-ro 77, Chuncheon, Gangwon-do 24253, South Korea. email@example.com
Telephone: +82-33-2405821 Fax: +82-33-2418064
Received: February 9, 2019 Peer-review started: February 12, 2019 First decision: February 26, 2019 Revised: March 4, 2019 Accepted: March 16, 2019 Article in press: March 16, 2019 Published online: April 14, 2019
Artificial intelligence (AI) using deep-learning (DL) has emerged as a breakthrough computer technology. By the era of big data, the accumulation of an enormous number of digital images and medical records drove the need for the utilization of AI to efficiently deal with these data, which have become fundamental resources for a machine to learn by itself. Among several DL models, the convolutional neural network showed outstanding performance in image analysis. In the field of gastroenterology, physicians handle large amounts of clinical data and various kinds of image devices such as endoscopy and ultrasound. AI has been applied in gastroenterology in terms of diagnosis, prognosis, and image analysis. However, potential inherent selection bias cannot be excluded in the form of retrospective study. Because overfitting and spectrum bias (class imbalance) have the possibility of overestimating the accuracy, external validation using unused datasets for model development, collected in a way that minimizes the spectrum bias, is mandatory. For robust verification, prospective studies with adequate inclusion/exclusion criteria, which represent the target populations, are needed. DL has its own lack of interpretability. Because interpretability is important in that it can provide safety measures, help to detect bias, and create social acceptance, further investigations should be performed.
Core tip: Artificial intelligence (AI) using deep-learning (DL) has emerged as a breakthrough computer technology. The convolutional neural network exhibited outstanding performance in image analysis. AI has been applied in the field of gastroenterology in terms of diagnosis, prognosis, and image analysis. However, potential inherent pitfalls of selection bias, overfitting, and spectrum bias (class imbalance) have the possibility of overestimating the accuracy and generalizing the result. Therefore, external validation using unused datasets for model development, collected in a way that minimizes the spectrum bias, is mandatory. DL has its own lack of interpretability, and further investigations should be performed on this issue.
Citation: Yang YJ, Bang CS. Application of artificial intelligence in gastroenterology. World J Gastroenterol 2019; 25(14): 1666-1683
Recently, artificial intelligence (AI) using deep-learning (DL) has emerged as a breakthrough computer technology, and numerous research studies, using AI applications to identify or differentiate images in various medical fields including radiology, neurology, orthopedics, pathology, ophthalmology, and gastroenterology, have been published. However, AI, the display of intelligent behavior indistinguishable from that of a human being, was already mentioned in the 1950s. Although AI has waxed and waned over the past six decades with seemingly little improvement, it was constantly applied to the medical field using various models of machine learning (ML) including Bayesian inferences, decision trees, linear discriminants, support vector machines (SVM), logistic regression, and artificial neural networks (ANNs).
By the era of big data, the accumulation of enormous digital images and medical records drove a need for the utilization of AI to efficiently deal with these data, which also become fundamental resources for the machine to learn by itself. Furthermore, the evolution of computing power with graphic processing units can overcome the limitations of traditional ML, particularly overtraining for input data (overfitting). This led to a revival of AI, especially when using DL technology, a new form of ML. Among several DL methods, the convolutional neural network (CNN), which consists of multilayers of ANN with step-by-step minimal processing, showed outstanding performance in image analysis and has received attention in AI (Figure 1 and Table 1).
Machine intelligence that has cognitive functions similar to those of humans such as “learning” and “problem solving.”
Mathematical algorithms which is automatically built from given data (known as input training data) and predicts or makes decisions in uncertain conditions without being explicitly programmed
Support vector machines
Discriminative classifier formally defined by an optimizing hyperplane with the largest functional margin
Artificial neural networks
Multilayered interconnected network which consists of an input, hidden connection (between the input and output layer), and output layer
Subset of machine learning technique that composed of multiple-layered neural network algorithms
Convolutional neural networks
Specific class of artificial neural networks that consists of (1) convolutional and pooling layers, which are the two main components to extract distinct features; and (2) fully connected layers to make an overall classification
Modelling error which occurs when a certain learning model tailors itself too much on the training dataset and predictions are not well generalized to new datasets
Systematic error occurs when the dataset used for model development does not adequately represent or reflect the range of patients who will be applied in clinical practice (target population)
Figure 1 Schematic graphical summary for artificial intelligence, machine learning and deep learning development.
A: Definition of artificial intelligence, machine learning (ML) and deep learning (DL). B: Comparison of process between classic ML and DL. C: Modes of learning and examples of ML.
In the field of gastroenterology, physicians handle large amounts of clinical data and various kinds of image devices such as esophagogastroduodenoscopy (EGD), colonoscopy, capsule endoscopy (CE), and ultrasound equipment. AI has been applied in the field of gastroenterology when making a diagnosis, predicting a prognosis, and analyzing images. Previous studies reported remarkable results of AI in gastroenterology. The rapid progression of AI demands that gastroenterologists learn the utility, strengths, and pitfalls of AI. In addition, physicians should prepare for the changes and effects of AI on real clinical practice in the near future. Hence, in this review, we aim to (1) briefly introduce an ML technology; (2) summarize an AI application in the field of gastroenterology, which is divided into two categories (statistical analysis for recognition of diagnosis or prediction of prognosis, and analyze images for patient applications excluding animal studies); and (3) discuss the challenges for the application and future directions of AI.
Generally, AI is considered as a machine intelligence that has cognitive functions similar to those of humans including “learning” and “problem solving”. Currently, ML is the most common approach of AI. It automatically builds mathematical algorithms from given data (known as input training data) and predicts or makes decisions in uncertain conditions without human instructions (Figure 1A). In the medical field, ML methods such as Bayesian networks, linear discriminants, SVMs, and ANNs have been used. A naïve Bayes classifier that represents the probabilistic relationship between input and output data is a typical classification model. The SVM, which was invented by Vladimir N Vapnik and Alexey Ya Chervonenkis in 1963, is a discriminative model that uses a dividing hyperplane. Before DL development, SVM showed the best performance for classification and regression, which were achieved by optimizing a hyperplane with the largest functional margin (distance from the hyperplane in a high- or infinite-dimensional space to the nearest training data point of any class).
An ANN is a multilayered interconnected network inspired by the neuronal connections of the human brain. Although the ANN was introduced by McCulloch and Walter in 1943, it was studied in 1957 by Frank Rosenblatt using the concept of the perceptron. The ANN as a hierarchical structure consists of an input, hidden connection (between the input and output layer), and output layer. The connection in the hidden layer has a strength (known as weight) that is used for the learning process of the network (Figure 1B). Through an appropriate training process (learning process), the network can adjust the value of the connection weight to optimize the best result (Figure 1C).
In the 1980s, an ANN with several hidden layers between the input and output layer was introduced. This was known as a DL (or a deep neural network). Although the ANN showed remarkable performance in managing nonlinear datasets regarding diagnosis and prognostic prediction in the medical field, the ANN revealed several weaknesses as well: a vanishing gradient, overfitting, insufficient computing capacity, and lack of training data. These weaknesses hampered the advancement of the ANN. Finally, the recent availability of big data provided sufficient input data for training, and the rapid progression of computing power allowed researchers to overcome prior limitations. Among several AI methods, DL received the attention of the public and has shown excellent performance in the computer vision area using CNNs.
A CNN consists of (1) convolutional and pooling layers, which are the two main components to extract distinct features; and (2) fully connected layers to make an overall classification. The input images were filtered to extract specialized features using numerous specific filters, and to create multiple feature maps. This preprocessing operation for filtering is called convolution. A learning process for the convolution filter to make the best feature maps is essential for success in a CNN. These feature maps are compressed to smaller sizes by pooling the pixels to capture a larger field of the image, and these convolutional and pooling layers are iterated many times. Finally, fully connected layers combine all features and produce the final outcomes (Figure 1B).
The rapid growth of the CNN was demonstrated at the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012 by Geoffrey Hinton, and several CNNs such as Inception from google and ResNet from Microsoft have shown excellent performance. A graphical summary of AI, ML, and DL development is shown in Figure 1.
APPLICATION OF AI IN GASTROENTEROLOGY
Recognition of diagnosis and prediction of prognosis
Although AI in the field of gastroenterology recently focused on image analysis, several ML models have shown promising results in the recognition of diagnosis and prediction of prognosis. The ANN is appropriate for dealing with complex datasets to overcome the drawbacks of traditional linear statistics. In addition, the ANN can stand for the sophisticated interactions between demographic, environmental, and clinical characteristics.
In terms of diagnosis, Pace et al demonstrated an ANN model in 2005 that made a diagnosis of gastroesophageal reflux disease using only 45 clinical variables in 159 cases with an accuracy of 100%. Lahner et al performed a similar pilot study to recognize atrophic gastritis solely by using clinical and biochemical variables from 350 outpatients by using ANNs and linear discriminant analysis. This study showed great accuracy.
Regarding the prediction of prognosis, in 1998, Pofahl et al compared an ANN model to the Ranson criteria and the Acute Physiologic and Chronic Health Evaluation (APACHE II) scoring system to predict the length of stay for patients with acute pancreatitis. The authors used a backpropagation neural network that was trained using 156 patients. Although the highest specificity (94%) was observed in the Ranson criteria, the ANN model showed the highest sensitivity (75%) when predicting a length of stay more than 7 d. Similar accuracy was observed for the Ranson criteria and APACHE II scoring system. In 2003, Das et al used an ANN to predict the outcomes of acute lower gastrointestinal bleeding using 190 patients. The authors compared the performance of ANNs to a previously validated scoring system (BLEED), which revealed a significantly better predictive accuracy of mortality (87% vs 21%), recurrent bleeding (89% vs 41%), and the need for therapeutic intervention (96% vs 46%) in the ANN model.
Sato et al presented an ANN model in 2005 to predict 1-year and 5-year survival using 418 esophageal cancer patients. This ANN model showed improved accuracy compared to the conventional linear discriminant analysis model.
Recently, the number of input training data items for ANNs was increased from hundreds to thousands of patients. Rotondano et al compared the Rockall score to a supervised ANN model to predict the mortality of nonvariceal upper gastrointestinal bleeding using 2380 patients. This approach showed superior sensitivity (83.8% vs 71.4%), specificity (97.5% vs 52.0%), accuracy (96.8% vs 52.9%), and area under receiver operating characteristic (AUROC) of the predictive performance (0.95 vs 0.67) in the ANN model to those in the complete Rockall score.
Takayama et al established an ANN model for the prediction of prognosis in patients with ulcerative colitis after cytoapheresis therapy and achieved a sensitivity and specificity for the need of an operation of 96% and 97%, respectively. Hardalaç et al established an ANN model to predict mucosal healing by azathioprine therapy in patients with inflammatory bowel disease (IBD) and achieved 79.1% correct classifications. Peng et al used an ANN model to predict the frequency of the onset, relapse, and severity of IBD. The researchers achieved an average accuracy to predict the frequency of onset and severity of IBD but a high accuracy in predicting the frequency of relapse of IBD (mean square error = 0.009, mean absolute percentage error = 17.1%).
SVMs have been used to analyze data and recognize patterns in classification analyses. Recently, Ichimasa et al analyzed 45 clinicopathological factors in 690 endoscopically resected T1 colorectal cancer patients to predict lymph node metastasis using a SVM. This approach showed superior performance (sensitivity 100%, specificity 66%, accuracy 69%) compared to those of American (sensitivity 100%, specificity 44%, accuracy 49%), European (sensitivity 100%, specificity 0%, accuracy 9%), and Japanese (sensitivity 100%, specificity 0%, accuracy 9%) guidelines. A prediction model using a SVM model reduced the amount of unnecessary additional surgery (77%) when misdiagnosing lymph node metastasis than those of a prediction model using American (85%), European (91%), and Japanese guidelines (91%). Yang et al constructed an SVM-based model using clinicopathological features and 23 immunologic markers from 483 patients who underwent curative surgery for esophageal squamous cell carcinoma. This study revealed reasonable performance in identifying high-risk patients with postoperative distant metastasis [sensitivity 56.6%, specificity 97.7%, positive predictive value (PPV) 95.6%, negative predictive value (NPV) 72.3%, and overall accuracy 78.7%] (Table 2).
Table 2 Summary of clinical studies using artificial intelligence for recognition of diagnosis and prediction of prognosis.
Although endoscopic screening programs have reduced the mortality from gastrointestinal malignancies, they are still the leading cause of death worldwide and remain a global economic burden. To enhance the detection rate of gastrointestinal neoplasms and optimize the treatment strategies, a high-quality endoscopic examination for the recognition of gastrointestinal neoplasms and classifications between benign and malignant lesions are essential for the gastroenterologist. Thus, gastroenterologists are interested in the applications of AI, especially when using CNNs and SVMs for image analysis. Furthermore, AI has been increasingly adopted in terms of non-neoplastic gastrointestinal diseases including infection, inflammation, or hemorrhage.
Upper gastrointestinal field: Takiyama et al constructed a CNN model that could recognize the anatomical location of EGD images with AUROCs of 1.00 for the larynx and esophagus, and 0.99 for the stomach and duodenum. This CNN model could also recognize specific anatomical locations within the stomach, with AUROCs of 0.99 for the upper, middle, and lower stomach.
To assist in the discrimination of early neoplastic lesions in Barrett’s esophagus, van der Sommen et al developed an automated algorithm to include specific textures, color filters, and ML from 100 endoscopic images. This algorithm reasonably detected early neoplastic lesions in a per-image analysis with a sensitivity and specificity of 83%. In 2017, the same group investigated a model to improve the detection rate of early neoplastic lesions in Barrett’s esophagus by using 60 ex vivo volumetric laser endomicroscopy images. This novel computer model showed optimal performance compared with a clinical volumetric laser endomicroscopy prediction score with a sensitivity of 90% and specificity of 93%.
Several studies evaluated the ML model using specialized endoscopy to differentiate neoplastic/dysplastic and non-neoplastic lesions. Kodashima et al showed that computer-based analysis can easily identify malignant tissue at the cellular level using endocytoscopic images, which enables the microscopic visualization of the mucosal surface. In 2015, Shin et al reported on an image analysis model to detect esophageal squamous dysplasia using high-resolution microendoscopy (HRME). The sensitivity and specificity of this model were 87% and 97%, respectively. During the following year, Quang et al from the same study group evolved this model, which was incorporated in tablet-interfaced HRME with full automation for real-time analysis. As a result, the model reduced the costs compared to previous laptop-interfaced HRME and showed good diagnostic yields of esophageal squamous cell carcinoma with a sensitivity and specificity of 95% and 91%, respectively. However, there was a limitation for the application of this model owing to the unavailability of specialized endoscopy.
Finally, Horie et al demonstrated the utility of AI using CNNs to make a diagnosis of esophageal cancer. This was trained with 8428 conventional endoscopic images including white-light images (WLIs) and narrow-band images (NBIs). This CNN model detected esophageal cancer with a sensitivity of 95% and could identify all small cancers of < 10 mm. This model also distinguished superficial esophageal cancer from advanced cancer with an accuracy of 98%.
Helicobacter pylori (H. pylori) infection is the most important risk factor of peptic ulcers and gastric cancer. Several researchers challenged AI to aid the endoscopic diagnosis of H. pylori infections. In 2004, Huang et al investigated the predictability of H. pylori infection by refined feature selection with a neural network using related gastric histologic features in endoscopic images. This model was trained and analyzed with 84 image parameters from 30 patients. The sensitivity and specificity for the detection of H. pylori infection were 85.4% and 90.9%, respectively. In addition, the accuracy of this model for identifying gastric atrophy, intestinal metaplasia, and predicting the severity of H. pylori-related gastric inflammation was higher than 80%.
Recently, two Japanese researchers reported on the application of a CNN to make a diagnosis of H. pylori infection[30,31]. Itoh et al developed a CNN model to recognize H. pylori infections by using 596 endoscopic images after the data augmentation of a prior set of 149 images. This CNN model showed promising results with a sensitivity and specificity of 86.7% and 86.7%, respectively. Shichijo et al compared the performance of a CNN to that of 23 endoscopists for the diagnosis of H. pylori infection by using endoscopic images. The CNN model showed superior sensitivity (88.9% vs 79.0%), specificity (87.4% vs 83.2%), accuracy (87.7% vs 82.4%), and diagnostic time (194 s vs 230 s).
In 2018, a prospective pilot study was conducted for automated diagnosis of H. pylori infections using image-enhanced endoscopy such as blue laser imaging-bright and linked color imaging. The performance of the developed AI model was significantly higher with blue laser imaging-bright and linked color imaging training (AUROCs of 0.96 and 0.95) than WLI imaging training (0.66).
The utility of AI in the diagnosis of gastrointestinal neoplasms was classified into two main categories: detection and characterization. In 2012, Kubota et al first evaluated a computer-aided pattern recognition system to identify the depth of the wall invasion of gastric cancer using endoscopic images. They used 902 endoscopic images and created a backpropagation model after a 10-time cross validation. As a result, the diagnostic accuracy was 77.2%, 49.1%, 51.0%, and 55.3% for T1-4 staging, respectively. In particular, the accuracy of T1a (mucosal invasion) and T1b staging (submucosal invasion) was 68.9% and 63.6%, respectively. Hirasawa et al reported on the good performance of a CNN-based diagnostic system to detect gastric cancers in endoscopic images. The authors trained the CNN model using 13584 endoscopic images and tested it with 2296 images. The overall sensitivity was 92.2%. In addition, the detection rate with a diameter of 6 mm or more was 98.6%, and all invasive cancers were identified. All missed lesions were superficially depressed and differentiated-type intramucosal cancers that were difficult to distinguish from gastritis even for experienced endoscopists. However, 69.4% of the lesions that the CNN diagnosed as gastric cancer were benign, and the most common reasons for misdiagnosis were gastritis with redness, atrophy, and intestinal metaplasia.
Zhu et al further applied a CNN system to discriminate the invasion depth of gastric cancer (M/SM1 vs deeper than SM1) using conventional endoscopic images. They trained a CNN model with 790 images and tested it with another 203 images. The CNN model showed high accuracy (89.2%) and specificity (95.6%) when determining the invasion depth of gastric cancer. This result was significantly superior to that of experienced endoscopists. Kanesaka et al studied a computer-aided diagnosis system using a SVM to facilitate the use of magnifying NBI to distinguish early gastric cancer. The study reported on remarkable potential in terms of diagnostic performance (accuracy 96.3%, PPV 98.3%, sensitivity 96.7%, and specificity 95%) and the performance of area concordance (accuracy 73.8%, PPV 75.3%, sensitivity 65.5%, and specificity 80.8%).
In terms of hepatology, the ultrasound has been challenged for the application of AI. Gatos et al established a SVM diagnostic model of chronic liver disease using ultrasound shear wave elastography (70 patients with chronic liver disease and 56 healthy controls). The performance was promising, with an accuracy of 87.3%, sensitivity of 93.5%, and specificity of 81.2%, although the prospective validation was not conducted. Kuppili et al established a fatty liver detection and characterization model using a single-layer feed-forward neural network, and validated this model with a higher accuracy than the previous SVM-based model. These researchers used ultrasound images of 63 patients, and the gold standard for labeling for each patient was the pathologic results of a liver biopsy.
The determination of liver cirrhosis was also challenged with ML technology. Liu et al developed a CNN model with ultrasound liver capsule images (44 images from controls and 47 images from patients with cirrhosis), and classified these images using a SVM. The AUROC for the classification was 0.951, although the prospective validation was not conducted.
Lower gastrointestinal field: Among various gastrointestinal fields, the development of an AI model using colonoscopy has been the most promising area because polyp detection during colonoscopies is frequent. This provides sufficient sources for AI training, and a missed colorectal polyp is directly associated with interval colorectal cancer development.
In terms of polyp detection, Fernandez-Esparrach et al established an automated computer-vision method using an energy map to detect colonic polyps in 2016. They used 24 videos containing 31 polyps and showed acceptable performance with a sensitivity of 70.4% and a specificity of 72.4% for polyp detection (Table 3). Recently, this performance was improved with a DL application for polyp detection[41,42]. Misawa et al designed a CNN model using 546 short videos from 73 full-length videos, which were divided into two groups of training data (105 polyp-positive videos and 306 polyp-negative videos) and test data (50 polyp-positive videos and 85 polyp-negative videos). The researchers showed the possibility of the automated detection of colonic polyps in real time, and the sensitivity and specificity were 90.0% and 63.3%, respectively. Urban et al also used a CNN system to identify colonic polyps. They used 8641 hand-labeled images and 20 colonoscopy videos in various combinations as training and test data. The CNN model detected polyps in real time with an AUROC of 0.991 and an accuracy of 96.4%. Moreover, it assisted in the identification of an additional nine polyps compared with expert endoscopists in the application of test colonoscopy videos.
Table 3 Summary of clinical studies using artificial intelligence in the upper gastrointestinal field.
44 images from controls and 47 images from patients with cirrhosis
Ultrasound liver capsule images
AI: Artificial intelligence; EGD: Esophagogastroduodenoscopy; CNN: Convolutional neural network; AUROC: Area under receiver operating characteristic; SVM: Support vector machine; HRME: High-resolution microendoscopy; NBI: Narrow band image; H. pylori: Helicobacter pylori; ANN: Artificial neural network; PPV: Positive predictive value.
Although there were many promising performances of the automated polyp detection models, a prospective validation was not conducted[43-45]. However, Klare et al performed a prototype software validation under real-time conditions (55 routine colonoscopies), and the results were comparable between those of endoscopists and the established software. The endoscopists’ polyp detection rates and adenoma detection rates were 56.4% and 30.9%, respectively, and these rates were 50.9% and 29.1% for the software, respectively). Wang et al established a DL algorithm by using data from 1290 patients, and validated this model with 27113 newly collected colonoscopy images from 1138 patients. This model showed remarkable performance with a sensitivity of 94.38%, specificity of 95.2%, and AUROC of 0.984 for at least one polyp detection.
For AI applications of polyp characterization, magnifying endoscopic images, which is useful when discriminating pit or vascular patterns, was first adopted to enhance the performance of AI. Tischendort et al developed an automated classification model of colorectal polyps by magnifying NBI images to evaluate vascular patterns in 2010. They reported that the overall accurate classification rates were 91.9% for a consensus decision between the human observers and 90.9% for a safe decision (classifying polyps as neoplastic in cases when there was an interobserver discrepancy). In 2011, Gross et al compared the performances of a computer-based model for the differentiation of small colonic polyps of < 10 mm using NBI images. The expert endoscopists and computer-based model showed comparable diagnostic performance in sensitivity (93.4% vs 95.0%), specificity (91.8% vs 90.3%), and accuracy (92.7% vs 93.1%).
Takemura et al retrospectively compared the identification of pit patterns of a computer-based model with shape descriptors such as area, perimeter, fit ellipse, or circularity in reference to endoscopic diagnosis by using magnified endoscopic images with crystal violet staining in 2010. The accuracies of the type I, II, IIIL, and IV pit patterns of colorectal lesions were 100%, 100%, 96.6%, and 96.7%, respectively. In 2012, the authors applied an upgraded version of a computer system via SVM to distinguish neoplastic and non-neoplastic lesions by using endoscopic NBI images, which showed a detection accuracy of 97.8%. They further demonstrated the availability of a real-time image recognition system in 2016, and the accuracy between the pathologic results of diminutive polyps and diagnosis by a real-time image recognition model was 93.2%.
Byrne et al developed a CNN model for the real-time differentiation of diminutive colorectal polyps by using only NBI video frames in 2017. This model discriminated adenomas from hyperplastic polyps with an accuracy of 94%, and identified the adenoma with a sensitivity of 98% and a specificity of 83%. Likewise, Chen et al made a CNN model trained with 2157 images to identify neoplastic or hyperplastic polyps of < 5 mm with a PPV and NPV of 89.6% and 91.5%, respectively. In 2017, Komeda et al reported on the preliminary data of a CNN model to distinguish adenomas from non-adenomatous polyps. The CNN model was trained with 1800 conventional endoscopic images with WLI, NBI, and chromoendoscopy, and the accuracy of a 10-hold cross-validation was 75.1%.
To enhance the differentiation of polyps, a Japanese study group reported several articles for AI application with endocytoscopy images, which enables the observation of nuclei on site, and showed comparable diagnostic results to those of pathologic examinations. In 2015, these researchers first developed a computer-aided diagnosis system using endocytoscopy for the discrimination of neoplastic changes in small polyps. This approach showed a comparable sensitivity (92.0%) and accuracy (89.2%) with those of expert endoscopists. In 2016, this research team developed a second-generation model that could (1) evaluate both nuclei and ductal lumens, (2) use an SVM instead of multivariate analysis, (3) provide the confidence levels of the decisions, and (4) provide a more rapid process of discriminating neoplastic changes from 0.3 s to 0.2 s. The endocytoscopic microvascular patterns could be effectively evaluated by staining with dye. These researchers also developed endocytoscopy with NBI without staining to evaluate microvascular findings. This approach showed an overall accuracy of 90%. The same group performed a prospective validation of a real-time computer-aided diagnosis system using endocytoscopy with NBI or stained images to identify neoplastic diminutive polyps. The researchers reported a pathologic prediction rate of 98.1%, and the time required to assess one diminutive polyp was about 35 to 47 s.
The application of a computer-aided ultrahigh (approximately 400 ×) magnification endocytoscopy system for the diagnosis of invasive colorectal cancers was investigated by Takeda et al. This system was trained with 5543 endocytoscopic images from 238 lesions and reported a sensitivity of 89.4%, specificity of 98.9%, and accuracy of 94.1% using 200 test images.
For the application of AI in IBD, Maeda et al developed a diagnosis system using a SVM after refining previous computer-aided endocytoscopy systems[56-58]. They evaluated the diagnostic performance of this model for the prediction of persistent histologic inflammation in ulcerative colitis patients. This model showed good performance with a sensitivity of 74%, specificity of 97%, and an accuracy of 91%.
Currently, the resolution of images is relatively low in capsule endoscopy compared to other digestive endoscopies. Moreover, the interpretation and diagnosis of capsule endoscopy images highly depends on the reviewer’s ability and effort. It is also a time-consuming process. Therefore, several conditions were attempted for the automated diagnosis of capsule endoscopy images including angioectasia, celiac disease, or intestinal hookworms, or for small intestinal motility characterization[62-65].
Leenhardt et al developed a gastrointestinal angiectasia detection model using semantic segmentation images with a CNN. They used 600 control images and 600 typical angiectasia images to form 4166 small bowel capsule endoscopy videos, which were divided equally into training and test data sets. The CNN-based model revealed a high diagnostic performance with a sensitivity of 100%, specificity of 96%, PPV of 96%, and NPV of 100% (Table 4). Zhou et al established a CNN model for the classification of celiac disease from control with capsule endoscopy clips from six celiac disease patients and five controls. The researchers achieved 100% sensitivity and specificity for the test data set. Moreover, the evaluation confidence was related to the severity level of small bowel mucosal lesions, reflecting the potential for the quantitative measurement of the existence and degree of pathology throughout the small intestine. Intestinal hookworms are difficult to find with direct visualization because they have small tubular structures with a whitish color and semitransparent features similar to background intestinal mucosa. Moreover, the presence of intestinal secretory materials makes them difficult to detect. He et al established a CNN model for the detection of hookworms in capsule endoscopy images. The CNN-based model showed a reasonable performance with a sensitivity of 84.6%, specificity of 88.6% and only 15% hookworm images and 11% non-hookworm image were falsely detected.
Table 4 Summary of clinical studies using artificial intelligence in the lower gastrointestinal field.
Training set: 5545 images from 1290 patients. Validation set A: 27113 images from 1138 patients. Validation set B: 612 images. Validation set C: 138 video clips from 110 patients. Validation set D: 54 videos from 54 patients
Dataset A: AUROC: 0.98 for at least one polyp detection, per-image sensitivity: 94.4%, per-image specificity: 95.2%. Dataset B: per-image sensitivity: 88.2%. Dataset C: per-image sensitivity: 91.6%, per-polyp sensitivity: 100%. Dataset D: per-image specificity: 95.4%
Prediction of persistent histologic inflammation in ulcerative colitis patients
Training set: 12900 images.Test set: 9935 images
Endocytoscopy with NBI
Accuracy: 91%, Sensitivity: 74%, Specificity: 97%
AI: Artificial intelligence; CNN: Convolutional neural network; NBI: Narrow band image; AUROC: Area under receiver operating characteristic; SVM: Support vector machine; PPV: Positive predictive value; NPV: Negative predictive value.
The interpretation of wireless motility capsule endoscopy is a complex task. Seguí et al established a CNN model for small-intestine motility characterization and achieved a mean classification accuracy of 96% for six intestinal motility events (“turbid”, “bubbles”, “clear blob”, “wrinkles”, “wall”, and, “undefined”). This outperformed the other classifiers by a large margin (a 14% relative performance increase).
CHALLENGES AND FUTURE DIRECTIONS FOR APPLICATION OF AI
Although many researchers have investigated the utility of AI and have shown promising results, most studies were designed in retrospective manner: as a case-control study from a single center, or by using endoscopic images that were chosen from specific endoscopic modalities unavailable from many institutions. Potential inherent bias such as selection bias cannot be excluded in this situation. Therefore, it is crucial to meticulously validate the performance of AI before the application of AI in real clinical practice. To properly verify the accuracy of AI, physicians should understand the effects of overfitting and spectrum bias (class imbalance) on the performance of AI, and try to evaluate the performance by avoiding these biases.
Overfitting occurs when a learning model tailors itself too much on the training dataset and predictions are not well generalized to new datasets (Table 5). Although several methods were used to reduce overfitting in the development of DL models, they did not guarantee the resolution of this problem. In addition, datasets that were collected by case-control design are particularly vulnerable to spectrum bias. Spectrum bias occurs when the dataset used for model development does not adequately represent the range of patients who will be applied in clinical practice (target population).
Table 5 Summary of clinical studies using artificial intelligence in the capsule endoscopy.
Because overfitting and spectrum bias may lead to overestimation of the accuracy and generalization, external validation using unused datasets for model development, collected in a way that minimizes the spectrum bias, is mandatory. For more robust clinical verification, well-designed multicenter prospective studies with adequate inclusion/exclusion criteria that represent the target population are needed. Furthermore, DL technology has its own “black box” nature (lack of interpretability or explainability), which means the decision mechanism of AI is not clearly demonstrated (Figure 2). Because interpretability is important in that it can provide safety measures, help to detect bias, and establish social acceptance, further investigation to solve this issue should be performed. However, there have been some methods to complement “black box” characteristics, such as the attention map and saliency region.
Figure 2 Interpretability-accuracy tradeoff in classification algorithms of machine learning.
It is obvious that the efficiency and accuracy of ML increases as the amount of data increases; however, it is challenging to develop an efficient ML model owing to the paucity of human labeled data given the issue of privacy with regard to private medical records. To overcome this issue, data augmentation strategies (with synthetically modified data) have been proposed. Spiking neural networks, which more closely mimic the real mechanisms of neurons, can potentially replace current ANN models with more powerful computing ability, although no effective supervised learning method currently exists.
The precision of diagnosis or classification using AI does not always mean efficacy in real clinical practice. The actual benefit of the clinical outcome, the satisfaction of physicians, and the cost effectiveness beyond the academic performance must be proven by sophisticated investigation. Finally, the acquisition of reasonable regulations from responsible authorities and a reimbursement policy are essential for integrating AI technology in the current healthcare environment. Moreover, AI is not perfect. That’s why “Augmented Intelligence” emerged emphasizing the fact that AI is designed to improve or enhance human intelligence rather than replace it. Although the aim of applying AI in medical practice is to improve the workflow with enhanced precision and to reduce the number of unintentional errors, established models with inaccuracy or exaggerated performance are likely to cause ethical issues owing to misdiagnosis or misclassification. Moreover, we do not know the impact of AI application on the doctor-patient relationship, which is an essential part of healthcare utilization and the practice of medicine. Therefore, ethical principles relevant to AI model development should be established in the current period when AI research begins to increase.
Since AI was introduced in the 1950s, it has been persistently challenged in terms of statistical or image analyses in the field of gastroenterology. Recent evaluation of big data and computer science enabled the dramatic development of AI technology, particularly DL, which showed promising potential. Now, there is no doubt that the implementation of AI in the gastroenterology field will progress in various healthcare services. To utilize AI wisely, physicians should make great effort to understand its feasibility and ameliorate the drawbacks through further investigation.
Manuscript source: Invited manuscript
Specialty type: Gastroenterology and hepatology
Country of origin: South Korea
Peer-review report classification
Grade A (Excellent): 0
Grade B (Very good): 0
Grade C (Good): C, C
Grade D (Fair): 0
Grade E (Poor): 0
P-Reviewer: Chiu KW, Triantafyllou K S-Editor: Ma RY L-Editor: A E-Editor: Song H
Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory; 1992 July 1, Pittsburgh, US.
New York: ACM 1992; 144-152.
Hastie T, Tibshirani R, Friedman J. The elements of Statistical Learning.
New York: Springer 2001; .
McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity.Bull Math Biophys. 1943;5:115-133.
ROSENBLATT F. The perceptron: a probabilistic model for information storage and organization in the brain.Psychol Rev. 1958;65:386-408.
Pace F, Buscema M, Dominici P, Intraligi M, Baldi F, Cestari R, Passaretti S, Bianchi Porro G, Grossi E. Artificial neural networks are able to recognize gastro-oesophageal reflux disease patients solely on the basis of clinical data.Eur J Gastroenterol Hepatol. 2005;17:605-610.
Lahner E, Grossi E, Intraligi M, Buscema M, Corleto VD, Delle Fave G, Annibale B. Possible contribution of artificial neural networks and linear discriminant analysis in recognition of patients with suspected atrophic body gastritis.World J Gastroenterol. 2005;11:5867-5873.
Pofahl WE, Walczak SM, Rhone E, Izenberg SD. Use of an artificial neural network to predict length of stay in acute pancreatitis.Am Surg. 1998;64:868-872.