Systematic Reviews
Copyright ©The Author(s) 2025.
World J Gastrointest Surg. Aug 27, 2025; 17(8): 109463
Published online Aug 27, 2025. doi: 10.4240/wjgs.v17.i8.109463
Table 1 The 45 studies’ journals included in this systematic review and the number of publications per journal
Journal title
Number of publications
Surgical Endoscopy and Other Interventional Techniques14
Endoscopy11
Digestive Endoscopy7
Annals of Surgery1
ANZ Journal of Surgery1
Cirugía y Cirujanos1
Colorectal Disease1
Diseases of the Colon & Rectum1
International Journal of Computer Assisted Radiology and Surgery1
International Journal of Surgery1
Journal of Gastrointestinal Surgery1
Journal of Surgical Research1
Obesity Surgery1
Surgery1
Surgical Laparoscopy Endoscopy & Percutaneous Techniques1
Table 2 Overview of studies focusing on preoperative planning and risk prediction in gastrointestinal surgery
Ref.
Focus area
AI method/model
Key outcome
Number of data points
Results
Galvis-García et al[13], 2023Colorectal polyp detection and classificationDeep learning, CNN, CAD (CADe/CADx)Increased adenoma and polyp detection rates using AI assisted colonoscopy in real time1038 patients (RCT), 8641 images (CNN study), 466 polyps (CADx), 238 lesions (endocytoscopy)Sensitivity up to 96.5%, specificity up to 93%, accuracy up to 96.4%, F1 score approximately 94%, NPV up to 99.6%
Zhang et al[14], 2023Diagnosis of choledocholithiasis in gallstone patientsMachine learning (7 models), AI (ModelArts)Developed and validated AI model with high diagnostic accuracy for CBD stones prediction1199 patients (681 with CBD stones)ModelArts AI: Accuracy 0.97, recall 097, precision 0.971, F1 score 097; machine learning AUCs: 0.77-0.81
Ahmad et al[15], 2022Detection of subtle and advanced colorectal neoplasiaDeep learning (ResNet 101 CNN)High sensitivity for detecting flat lesions, sessile serrated lesions, and advanced colorectal polyps173 polyps, 35114 polyp positive frames, 634988 polyp negative frames across multiple datasetsPer polyp sensitivity: 100%, 98.9%, and 79.5% (in subtle set); F1 score: Up to 87.9%; CNN outperformed expert and trainee endoscopists in detection speed and accuracy
Lei et al[16], 2023Polyp detection via colon capsule endoscopyDeep learning (CNN, AiSPEED)Feasibility and diagnostic accuracy of AI assisted reading compared to clinician interpretationTarget: 674 patients (597 needed for power; both prospective and retrospective recruitment)Sensitivity/specificity of AI to be compared to clinician standard; exact results pending study is ongoing
Eckhoff et al[17], 2023Surgical phase recognition for Ivor-Lewis esophagectomyCNN + LSTM (TEsoNet), transfer learningDemonstrated feasibility of knowledge transfer from sleeve gastrectomy to esophagectomy with moderate accuracy60 sleeve gastrectomy videos, 40 esophagectomy videos (used in combinations across 5 experiments)Single procedure accuracy: 87.7% (sleeve). Transfer learning: 23.4% overall (4 overlapping phases: 58.6%). Co training max accuracy: 40.8%
Blum et al[18], 2024Prediction of choledocholithiasisLogistic regression, RF, XGBoost, KNN; ensemble modelMachine learning models can outperform ASGE guidelines in predicting choledocholithiasis risk using pre MRCP data222 patientsAUROC: 0.83 (RF), accuracy: 0.81 (ensemble), sensitivity: Up to 0.94, F1 score: Up to 0.82
Axon[19], 2020Evolution and future of digestive endoscopyConceptual AI (future prediction)Highlights AI’s potential to surpass expert level diagnostic accuracy and support real time treatment decisionsPredicts that AI will revolutionize diagnosis and treatment in endoscopy with real time support tools
Hsu et al[20], 2023Predicting postoperative GIB after bariatric surgeryRF, XGBoost, NNsMachine learning models outperform logistic regression in predicting GIB, aiding clinical decision making159959 patients (632 with GIB)RF AUROC: 0.764, sensitivity: 75.4%, specificity: 70.0%; XGBoost AUROC: 0.746; NN AUROC: 0.741; logistic regression AUROC: 0.709
Athanasiadis et al[21], 2025Accuracy of self-assessment in laparoscopic cholecystectomy (CVS quality)Surgeons frequently overestimate CVS performance; self-assessment alone is insufficient25 surgeons enrolled, 13 submitted 1 video, 4 submitted 2 videosNo surgeon achieved adequate CVS per expert review; significant discrepancy between self and expert ratings on Strasberg scale
Bisschops et al[22], 2019Advanced endoscopic imaging for colorectal neoplasiaGuidance on when to use advanced imaging (HD WLE, CE, NBI, etc.) for detection and differentiation of colorectal lesionsAI suggested for future use if validated; various imaging techniques reviewed for ADR, miss rates, lesion detection
Han et al[23], 2025Intraoperative recognition of PAN during total mesorectal excisionDeepLabv3+ with ResNet50 backboneAI model (AINS) achieved real time neuro recognition of PAN, aiding nerve preservation during rectal cancer surgery1780 images (1424 training, 356 validation)Accuracy: 0.9609, precision: 0.7494, recall: 0.6587, F1 score: 0.7011; AI outperformed surgeons (F1: 0.4568) and operated faster (3 minutes vs 25 minutes)
Table 3 Summary of studies related to artificial intelligence-based intraoperative guidance in gastrointestinal surgery
Ref.
Surgical context
AI method/model
Guidance function
Data type/source
Real time capability
Performance metrics
Sato et al[24], 2022Thoracoscopic esophagectomyDeepLabv3+ (CNN based semantic segmentation)Recurrent laryngeal nerve identification and navigation3000 annotated intraoperative images from 20 videos (train/val) + 40 test images from 8 other videosYesDice: AI 0.58, expert 0.62, general surgeons 0.47
Niikura et al[25], 2022Upper GI endoscopySingle shot multibox detector (CNN)Detection of gastric cancer in endoscopic images23892 white light images from 500 patients (1:1 matched to expert endoscopists)NoPer patient: 100%, per image: 99.87%, IOU: 0.842
Yang et al[26], 2022EUS for subepithelial lesionsResNet 50 (CNN based deep learning)Differentiation of gastrointestinal stromal tumors and leiomyomas using EUS10439 EUS images from 752 patients (multicenter); 132 prospective patients for clinical validationYesAUC: 0.986 (internal), 0.642 (external), accuracy: 96.2%
Schnelldorfer et al[27], 2024Staging laparoscopyYOLOv5 (detection), ensemble ResNet18 CNN (classification)Identification and classification of peritoneal surface metastases4287 lesions from 132 patients (365 biopsied lesions; 3650 image patches)NoAUC PR: 0.69, AUROC: 0.78, accuracy: 78%
Guo et al[28], 2021GI endoscopy (multi lesion)Deep CNN (ResNet 50 with TTA and transfer learning)Detecting four lesion categories to support endoscopic diagnosis327121 WLI images for training; 33959 images in validation from 1734 casesNo (0.05 seconds/image)Sensitivity: 88.3%, specificity: 90.3%, accuracy: 89.7%
Houwen et al[29], 2022ColonoscopyDefines competence standards (SODA) for AI or endoscopists in optical diagnosisSimulation studies + systematic literature review; ESGE Delphi consensus panelYesSensitivity ≥ 90%, specificity ≥ 80% (leave in situ); ≥ 80% both (resect and discard)
Tatar and Çubukçu[30], 2024ColonoscopyYOLOv8 based CNN (ColoNet)Identification of neoplastic/premalignant/malignant lesions; biopsy decision support1760 colonoscopy images (306 patients) for training/validation; 91 external images for real time testingYesmAP50: 0.832, accuracy: 82.4%, sensitivity: 70.7%, specificity: 92.0%
Table 4 Artificial intelligence applications for postoperative monitoring and complication prediction in gastrointestinal care
Ref.
Complication targeted
AI method/model
Timing of prediction
Data source/type
Performance metrics
Clinical utility
van de Sande et al[31], 2022Need for hospital specific interventions (e.g., reoperation, radiological intervention, IV antibiotics)Random forest (4 variants tested)After second postoperative dayEHR data from 3 non-academic hospitals; 18 perioperative variables (e.g., age, BMI, ASA, meds, surgery time)AUROC: 0.83, sensitivity: 77.9%, specificity: 79.2%, PPV: 61.6%, NPV: 89.3%Supports early safe discharge and capacity management
Choi et al[32], 2024Missed small bowel lesions in negative capsule endoscopyCNNAfter initial human reading of SBCE videos103 negative SBCE videos retrospectively reanalyzed; images from two academic hospitalsCNN detected additional lesions in 61.2% of cases; model had > 96% accuracy in prior study (AUROC = 0.9957)Reduces diagnostic oversight; changed diagnosis in 10.3%
Blum et al[18], 2024CholedocholithiasisLogistic regression, random forest, XGBoost, KNN, ensembleBefore MRCP, using pre intervention dataRetrospective data from 222 patients (clinical, biochemical, imaging variables) from Royal Hobart HospitalEnsemble & random forest model (accuracy: 0.81, AUROC: 0.83, sensitivity: 0.94, specificity: 0.69, F1: 0.82)Avoids unnecessary MRCP, triages patients for ERCP
Haak et al[33], 2022Incomplete response after CRT in rectal cancerCNNs (EfficientNet B2, Xception, etc.); FFN for clinical data; combined modelAfter chemoradiation, pre surgery226 patients; 731 endoscopic images; clinical features from single institute retrospective cohortEfficientNet B2-AUC: 0.83, accuracy: 0.75, sensitivity: 0.77, specificity: 0.75, PPV: 0.74, NPV: 0.77Identifies candidates for non-surgical follow up
Noar and Khan[34], 2023GPAI derived GMAT threshold using multivariate regressionPre-treatment (based on GMA from EGG + WLST)30 patients with GP; GMA via EGG; WLST; gastric emptying tests; symptom scores (GCSI DD, Leeds)Sensitivity: 96%, specificity: 75%, accuracy: 93%, AUC: 0.95 for GMAT ≥ 0.59Guides selection for balloon dilation; personalized therapy
Table 5 Artificial intelligence applications in endoscopic diagnosis and quality control for gastrointestinal cancers
Ref.
Cancer type/lesion
Endoscopic modality
AI method/model
Task/objective
Dataset size/source
Performance metrics
Clinical relevance/impact
Messmann et al[35], 2022BERNUpper GI endoscopyAI assisted deep learning systemsReal time detection and localization of Barrett’s neoplasiaMeta analyzer and real time work (n > 1000 images)Sensitivity: 83.7%-95.4%, accuracy: 88%-96%Improves detection of subtle lesions; supports targeted biopsies over Seattle protocol
Choi et al[36], 2022EGDCNN (squeeze and excitation network)Classification of anatomical landmarks and completeness of photo documentation2599 images from 250 EGD procedures (Korea University Hospital)Landmark classification: Accuracy 97.58%, sensitivity 97.42%, specificity 99.66%; completeness detection: Accuracy 89.20%, specificity 100%Enhances quality control in EGD by verifying complete anatomical documentation automatically
Inaba et al[37], 2024Colonoscopy (preparation phase)MobileNetV3 based CNN (smartphone app)AI based stool image classification to assess bowel preparation quality1689 images from 121 patients; 106 patient prospective validationAccuracy: 90.2% (grade 1), 65.0% (grade 2), 89.3% (grade 3); BBPS ≥ 6 in 99.0% of app usersImproved bowel prep monitoring; 100% cecal intubation; reduced burden on patients and nurses
Zhang et al[14], 2023Suspected choledocholithiasisNot applicable (pre-endoscopy prediction)ModelArts AI platform (Huawei); 7 machine learning models also testedPredictive classification of CBD stones before cholecystectomy1199 patients with symptomatic gallstones; retrospective, single centerModelArts AI: Accuracy 0.97, recall 097, precision 0.971, F1 score 0.97May outperform guideline based risk stratification; reduces unnecessary ERCP
Wu et al[38], 2021EGCEGDENDOANGEL system (CNNs + deep reinforcement learning)Real time monitoring of blind spots and detection of EGC1050 patients in multicenter RCT; 196 gastric lesions biopsiedAccuracy: 84.7%, sensitivity: 100%, specificity: 84.3%Reduced blind spots, improved EGD quality, potential for real time EGC detection in clinical setting
Rondonotti et al[39], 2023DRSPs ≤ 5 mmColonoscopy with blue light imagingCAD EYE (Fujifilm, Tokyo, Japan), CNN based real time systemOptical diagnosis to support “resect and discard” strategy596 DRSPs in 389 patients, 4 center prospective study (Italy)NPV: 91.0%, sensitivity: 88.6%, specificity: 88.1%, accuracy: 88.4%Meets ASGE PIVI thresholds; may enable safe omission of histology in DRSPs, especially beneficial for nonexperts
Koh et al[40], 2023Colonic adenomas including SSAColonoscopyGI Genius (CADe system, Medtronic, MN, United States)Real time detection of colonic polyps and ADR improvement298 colonoscopies; 487 AI “hits”; 250 polyps removedPost AI ADR: 30.4% vs baseline 243% (P = 0.02); SSA rate: 5.6%Enhanced ADR even in experienced endoscopists; improved SSA detection; supports AI use in routine colonoscopy
Yuan et al[41], 2022Gastric lesions (EGC, AGC, SMT, polyp, PU, erosion)White light endoscopyYOLO based DCNN modelMulticlass diagnosis of six gastric lesions + lesion free mucosa31388 images (29809 train/1579 test) from 9443 patientsOverall accuracy: 85.7%; EGC: Sensitivity 59.2%, specificity 99.3%; AGC: Sensitivity 100%, specificity 98.1%Comparable to senior endoscopists; improved diagnostic accuracy and efficiency; potential for real time support in diverse gastric lesion detection
Munir et al[42], 2024Not applicable (survey based assessment)ChatGPTEvaluation of AI responses to perioperative GI surgery questions1080 responses assessed by 45 surgeonsMajority graded “fair” or “good” (57.6%); highest “very good/excellent” rate for cholecystectomy (45.3%)ChatGPT may aid in patient education, but only 20% deemed it accurate; limited utility in reducing message load
Sudarevic et al[43], 2023Colorectal polypsColonoscopyPoseidon system (EndoMind + waterjet based AI)AI based in situ measurement of polyp size using waterjet as reference28 polyps in silicone model + 29 polyps in routine colonoscopiesMedian error: Poseidon 7.4% (model), 7.7% (clinical); visual: 25.1%/22.1%; forceps: 20.0%Significantly improved sizing accuracy; does not require additional tools; useful for clinical polyp surveillance and resection decisions
Tsuboi et al[44], 2020Small bowel angioectasiaCapsule endoscopy (PillCam SB2/SB3, Medtronic, MN, United States)CNN (single shot multibox detector)Automatic detection of angioectasia in CE images2237 training images, 10488 validation images (488 angioectasia, 10000 normal)AUC: 0.998; sensitivity: 98.8%, specificity: 98.4%, PPV: 75.4%, NPV: 99.9%Enables high accuracy detection of angioectasia; may reduce oversight and physician workload during capsule reading
Chang et al[45], 2022Upper GI endoscopy (EGD)ResNeSt deep learning modelEvaluate photodocumentation completeness via anatomical classification15305 training images; 15723 test images from 472 EGD casesAccuracy: 96.64% (deep learning model), Photodocumentation rate: 78% (esophagus duodenum), 53.8% (pharynx duodenum)Enables automated auditing of image completeness; higher completeness linked to higher ADR; applicable for routine EGD quality control
Hwang et al[46], 2021Small bowel hemorrhagic and ulcerative lesionsCEVGGNet based CNN + Grad CAMClassification and localization of hemorrhagic vs ulcerative lesions30224 abnormal + 30224 normal images (train); 5760 images (validation)Combined model: Accuracy 96.83%, sensitivity 97.61%, specificity 96.04%, AUROC approximately 0.996Enhanced lesion localization without manual annotation; Grad CAM improves interpretability; supports efficient clinical CE analysis
Jazi et al[47], 2023Not applicable (survey + clinical scenarios)ChatGPT 4 (LLM by OpenAI)Assess alignment of ChatGPT 4 with expert opinions on bariatric surgery suitability and recommendations10 patient scenarios; 30 international bariatric surgeonsExpert match: 30%; ChatGPT 4 inconsistency: 40%; recommended surgery in 60% vs experts 90%ChatGPT 4 showed limited alignment and inconsistency; suitable for education, but not yet reliable for clinical decision making
Meinikheim et al[48], 2024BERNUpper GI endoscopy (video based)DeepLabV3+ with ResNet50 backbone (clinical decision support system)Evaluate add on effect of AI on endoscopist performance in BERN detection96 videos from 72 patients; 51273 images (train); 22 endoscopists from 12 centersAI alone: Sensitivity 92.2%, specificity 68.9%, accuracy 81.3%; nonexperts with AI: Sensitivity up from 69.8% to 78.0%, specificity up from 67.3% to 72.7%AI significantly improved nonexperts’ diagnostic performance and confidence; comparable accuracy to experts; highlights human AI interaction dynamics
Ahmad et al[49], 2021ColonoscopyIdentify top research priorities for AI implementation in colonoscopy15 international experts from 9 countries; 3 Delphi roundsNot performance focused; methodology scores used for consensusProvides a structured framework to guide future AI implementation research in colonoscopy; emphasizes clinical trial design, data annotation, integration, and regulation
Lazaridis et al[50], 2021CEAssess adherence to ESGE guidelines and future perspectives on CE use217 respondents from 47 countries via ESGE surveyNot model based; survey: 91% performed CE with appropriate indication; 84.1% classified findings as relevant/irrelevantHighlights variation in guideline adherence; AI identified as top development priority (56.2%); suggests need for standardization and formal CE training
Tian et al[51], 2024EUSCNN with attention moduleAutomatic identification of 14 standard BPS anatomical sites on EUS6230 training images (1812 patients), internal: 1569 images (47 patients), external: 85322 images (131 patients from 16 centers)Sensitivity: 89.45%-99.92%, specificity: 93.35%-99.79%, accuracy (internal): 92.1%-100%, kappa: 0.84-0.98Outperforms beginners, comparable to experts; enables efficient, high quality anatomical identification in EUS; potential for training and standardization
He et al[52], 2020Upper GI endoscopy (EGD)CNN models (DenseNet 121, ResNet 50, VGG, etc.)Automated classification of 11 anatomical sites for quality control and reporting3704 images from 211 routine EGD cases (Tianjin Medical University Hospital, Tianjin, China)(DenseNet 121): Accuracy approximately 91.11%, F1 scores up to 94.92% for specific sitesSupports automated quality assurance in EGD via accurate site classification; aids report generation and completeness verification
Table 6 Artificial intelligence based and robotics enhanced surgical education and performance assessment
Ref.
Surgical procedure/task
AI method/model
Assessment modality
Data type/source
Performance metrics
Educational outcome
Garfinkle et al[53], 2022Gastrointestinal and endoscopic surgery (priority setting)Survey/DelphiSurvey data from SAGES member surgeonsIdentified core needs: Video training, tech adoption
Huo et al[54], 2024Surgical decision making for GERDChatGPT 3.5/4, copilot, Google Bard, Perplexity AIPrompt based guideline comparisonStandardized clinical vignettes based on SAGES guidelinesSurgeons (accuracy): Bard 6/7 (85.7%), ChatGPT 4 5/7 (71.4%); patients (accuracy): Bard 4/5 (80.0%), ChatGPT 4 3/5 (60.0%); children: Copilot & Bard 3/3 (100.0%)Revealed inconsistencies in LLM advice; need for medical domain training
Huo et al[55], 2024Surgical management of GERDGeneric ChatGPT 4 vs customized GPT (GTS)Prompt based guideline comparison60 surgeon cases and 40 patient cases based on SAGES and UEG EAES guidelinesGTS (custom GPT): 100% accuracy for both surgeons (60/60) and patients (40/40), generic GPT 4 66.7% (40/60) for surgeons, 47.5% (19/40) for patientsDemonstrated impact of domain customization in LLMs
Nasir et al[56], 2021Robotic rectal cancer surgeryNo AI model used (robotics only)RCTs, observational studies, registry dataReduced conversion to open surgery (especially in obese/male patients); improved urogenital function; no difference in long term oncologic outcomesReinforced need for structured robotic training (e.g., EARCS)