Copyright
©The Author(s) 2025.
World J Gastrointest Surg. Aug 27, 2025; 17(8): 109463
Published online Aug 27, 2025. doi: 10.4240/wjgs.v17.i8.109463
Published online Aug 27, 2025. doi: 10.4240/wjgs.v17.i8.109463
Table 1 The 45 studies’ journals included in this systematic review and the number of publications per journal
Journal title | Number of publications |
Surgical Endoscopy and Other Interventional Techniques | 14 |
Endoscopy | 11 |
Digestive Endoscopy | 7 |
Annals of Surgery | 1 |
ANZ Journal of Surgery | 1 |
Cirugía y Cirujanos | 1 |
Colorectal Disease | 1 |
Diseases of the Colon & Rectum | 1 |
International Journal of Computer Assisted Radiology and Surgery | 1 |
International Journal of Surgery | 1 |
Journal of Gastrointestinal Surgery | 1 |
Journal of Surgical Research | 1 |
Obesity Surgery | 1 |
Surgery | 1 |
Surgical Laparoscopy Endoscopy & Percutaneous Techniques | 1 |
Table 2 Overview of studies focusing on preoperative planning and risk prediction in gastrointestinal surgery
Ref. | Focus area | AI method/model | Key outcome | Number of data points | Results |
Galvis-García et al[13], 2023 | Colorectal polyp detection and classification | Deep learning, CNN, CAD (CADe/CADx) | Increased adenoma and polyp detection rates using AI assisted colonoscopy in real time | 1038 patients (RCT), 8641 images (CNN study), 466 polyps (CADx), 238 lesions (endocytoscopy) | Sensitivity up to 96.5%, specificity up to 93%, accuracy up to 96.4%, F1 score approximately 94%, NPV up to 99.6% |
Zhang et al[14], 2023 | Diagnosis of choledocholithiasis in gallstone patients | Machine learning (7 models), AI (ModelArts) | Developed and validated AI model with high diagnostic accuracy for CBD stones prediction | 1199 patients (681 with CBD stones) | ModelArts AI: Accuracy 0.97, recall 097, precision 0.971, F1 score 097; machine learning AUCs: 0.77-0.81 |
Ahmad et al[15], 2022 | Detection of subtle and advanced colorectal neoplasia | Deep learning (ResNet 101 CNN) | High sensitivity for detecting flat lesions, sessile serrated lesions, and advanced colorectal polyps | 173 polyps, 35114 polyp positive frames, 634988 polyp negative frames across multiple datasets | Per polyp sensitivity: 100%, 98.9%, and 79.5% (in subtle set); F1 score: Up to 87.9%; CNN outperformed expert and trainee endoscopists in detection speed and accuracy |
Lei et al[16], 2023 | Polyp detection via colon capsule endoscopy | Deep learning (CNN, AiSPEED™) | Feasibility and diagnostic accuracy of AI assisted reading compared to clinician interpretation | Target: 674 patients (597 needed for power; both prospective and retrospective recruitment) | Sensitivity/specificity of AI to be compared to clinician standard; exact results pending study is ongoing |
Eckhoff et al[17], 2023 | Surgical phase recognition for Ivor-Lewis esophagectomy | CNN + LSTM (TEsoNet), transfer learning | Demonstrated feasibility of knowledge transfer from sleeve gastrectomy to esophagectomy with moderate accuracy | 60 sleeve gastrectomy videos, 40 esophagectomy videos (used in combinations across 5 experiments) | Single procedure accuracy: 87.7% (sleeve). Transfer learning: 23.4% overall (4 overlapping phases: 58.6%). Co training max accuracy: 40.8% |
Blum et al[18], 2024 | Prediction of choledocholithiasis | Logistic regression, RF, XGBoost, KNN; ensemble model | Machine learning models can outperform ASGE guidelines in predicting choledocholithiasis risk using pre MRCP data | 222 patients | AUROC: 0.83 (RF), accuracy: 0.81 (ensemble), sensitivity: Up to 0.94, F1 score: Up to 0.82 |
Axon[19], 2020 | Evolution and future of digestive endoscopy | Conceptual AI (future prediction) | Highlights AI’s potential to surpass expert level diagnostic accuracy and support real time treatment decisions | Predicts that AI will revolutionize diagnosis and treatment in endoscopy with real time support tools | |
Hsu et al[20], 2023 | Predicting postoperative GIB after bariatric surgery | RF, XGBoost, NNs | Machine learning models outperform logistic regression in predicting GIB, aiding clinical decision making | 159959 patients (632 with GIB) | RF AUROC: 0.764, sensitivity: 75.4%, specificity: 70.0%; XGBoost AUROC: 0.746; NN AUROC: 0.741; logistic regression AUROC: 0.709 |
Athanasiadis et al[21], 2025 | Accuracy of self-assessment in laparoscopic cholecystectomy (CVS quality) | Surgeons frequently overestimate CVS performance; self-assessment alone is insufficient | 25 surgeons enrolled, 13 submitted 1 video, 4 submitted 2 videos | No surgeon achieved adequate CVS per expert review; significant discrepancy between self and expert ratings on Strasberg scale | |
Bisschops et al[22], 2019 | Advanced endoscopic imaging for colorectal neoplasia | Guidance on when to use advanced imaging (HD WLE, CE, NBI, etc.) for detection and differentiation of colorectal lesions | AI suggested for future use if validated; various imaging techniques reviewed for ADR, miss rates, lesion detection | ||
Han et al[23], 2025 | Intraoperative recognition of PAN during total mesorectal excision | DeepLabv3+ with ResNet50 backbone | AI model (AINS) achieved real time neuro recognition of PAN, aiding nerve preservation during rectal cancer surgery | 1780 images (1424 training, 356 validation) | Accuracy: 0.9609, precision: 0.7494, recall: 0.6587, F1 score: 0.7011; AI outperformed surgeons (F1: 0.4568) and operated faster (3 minutes vs 25 minutes) |
Table 3 Summary of studies related to artificial intelligence-based intraoperative guidance in gastrointestinal surgery
Ref. | Surgical context | AI method/model | Guidance function | Data type/source | Real time capability | Performance metrics |
Sato et al[24], 2022 | Thoracoscopic esophagectomy | DeepLabv3+ (CNN based semantic segmentation) | Recurrent laryngeal nerve identification and navigation | 3000 annotated intraoperative images from 20 videos (train/val) + 40 test images from 8 other videos | Yes | Dice: AI 0.58, expert 0.62, general surgeons 0.47 |
Niikura et al[25], 2022 | Upper GI endoscopy | Single shot multibox detector (CNN) | Detection of gastric cancer in endoscopic images | 23892 white light images from 500 patients (1:1 matched to expert endoscopists) | No | Per patient: 100%, per image: 99.87%, IOU: 0.842 |
Yang et al[26], 2022 | EUS for subepithelial lesions | ResNet 50 (CNN based deep learning) | Differentiation of gastrointestinal stromal tumors and leiomyomas using EUS | 10439 EUS images from 752 patients (multicenter); 132 prospective patients for clinical validation | Yes | AUC: 0.986 (internal), 0.642 (external), accuracy: 96.2% |
Schnelldorfer et al[27], 2024 | Staging laparoscopy | YOLOv5 (detection), ensemble ResNet18 CNN (classification) | Identification and classification of peritoneal surface metastases | 4287 lesions from 132 patients (365 biopsied lesions; 3650 image patches) | No | AUC PR: 0.69, AUROC: 0.78, accuracy: 78% |
Guo et al[28], 2021 | GI endoscopy (multi lesion) | Deep CNN (ResNet 50 with TTA and transfer learning) | Detecting four lesion categories to support endoscopic diagnosis | 327121 WLI images for training; 33959 images in validation from 1734 cases | No (0.05 seconds/image) | Sensitivity: 88.3%, specificity: 90.3%, accuracy: 89.7% |
Houwen et al[29], 2022 | Colonoscopy | Defines competence standards (SODA) for AI or endoscopists in optical diagnosis | Simulation studies + systematic literature review; ESGE Delphi consensus panel | Yes | Sensitivity ≥ 90%, specificity ≥ 80% (leave in situ); ≥ 80% both (resect and discard) | |
Tatar and Çubukçu[30], 2024 | Colonoscopy | YOLOv8 based CNN (ColoNet) | Identification of neoplastic/premalignant/malignant lesions; biopsy decision support | 1760 colonoscopy images (306 patients) for training/validation; 91 external images for real time testing | Yes | mAP50: 0.832, accuracy: 82.4%, sensitivity: 70.7%, specificity: 92.0% |
Table 4 Artificial intelligence applications for postoperative monitoring and complication prediction in gastrointestinal care
Ref. | Complication targeted | AI method/model | Timing of prediction | Data source/type | Performance metrics | Clinical utility |
van de Sande et al[31], 2022 | Need for hospital specific interventions (e.g., reoperation, radiological intervention, IV antibiotics) | Random forest (4 variants tested) | After second postoperative day | EHR data from 3 non-academic hospitals; 18 perioperative variables (e.g., age, BMI, ASA, meds, surgery time) | AUROC: 0.83, sensitivity: 77.9%, specificity: 79.2%, PPV: 61.6%, NPV: 89.3% | Supports early safe discharge and capacity management |
Choi et al[32], 2024 | Missed small bowel lesions in negative capsule endoscopy | CNN | After initial human reading of SBCE videos | 103 negative SBCE videos retrospectively reanalyzed; images from two academic hospitals | CNN detected additional lesions in 61.2% of cases; model had > 96% accuracy in prior study (AUROC = 0.9957) | Reduces diagnostic oversight; changed diagnosis in 10.3% |
Blum et al[18], 2024 | Choledocholithiasis | Logistic regression, random forest, XGBoost, KNN, ensemble | Before MRCP, using pre intervention data | Retrospective data from 222 patients (clinical, biochemical, imaging variables) from Royal Hobart Hospital | Ensemble & random forest model (accuracy: 0.81, AUROC: 0.83, sensitivity: 0.94, specificity: 0.69, F1: 0.82) | Avoids unnecessary MRCP, triages patients for ERCP |
Haak et al[33], 2022 | Incomplete response after CRT in rectal cancer | CNNs (EfficientNet B2, Xception, etc.); FFN for clinical data; combined model | After chemoradiation, pre surgery | 226 patients; 731 endoscopic images; clinical features from single institute retrospective cohort | EfficientNet B2-AUC: 0.83, accuracy: 0.75, sensitivity: 0.77, specificity: 0.75, PPV: 0.74, NPV: 0.77 | Identifies candidates for non-surgical follow up |
Noar and Khan[34], 2023 | GP | AI derived GMAT threshold using multivariate regression | Pre-treatment (based on GMA from EGG + WLST) | 30 patients with GP; GMA via EGG; WLST; gastric emptying tests; symptom scores (GCSI DD, Leeds) | Sensitivity: 96%, specificity: 75%, accuracy: 93%, AUC: 0.95 for GMAT ≥ 0.59 | Guides selection for balloon dilation; personalized therapy |
Table 5 Artificial intelligence applications in endoscopic diagnosis and quality control for gastrointestinal cancers
Ref. | Cancer type/lesion | Endoscopic modality | AI method/model | Task/objective | Dataset size/source | Performance metrics | Clinical relevance/impact |
Messmann et al[35], 2022 | BERN | Upper GI endoscopy | AI assisted deep learning systems | Real time detection and localization of Barrett’s neoplasia | Meta analyzer and real time work (n > 1000 images) | Sensitivity: 83.7%-95.4%, accuracy: 88%-96% | Improves detection of subtle lesions; supports targeted biopsies over Seattle protocol |
Choi et al[36], 2022 | EGD | CNN (squeeze and excitation network) | Classification of anatomical landmarks and completeness of photo documentation | 2599 images from 250 EGD procedures (Korea University Hospital) | Landmark classification: Accuracy 97.58%, sensitivity 97.42%, specificity 99.66%; completeness detection: Accuracy 89.20%, specificity 100% | Enhances quality control in EGD by verifying complete anatomical documentation automatically | |
Inaba et al[37], 2024 | Colonoscopy (preparation phase) | MobileNetV3 based CNN (smartphone app) | AI based stool image classification to assess bowel preparation quality | 1689 images from 121 patients; 106 patient prospective validation | Accuracy: 90.2% (grade 1), 65.0% (grade 2), 89.3% (grade 3); BBPS ≥ 6 in 99.0% of app users | Improved bowel prep monitoring; 100% cecal intubation; reduced burden on patients and nurses | |
Zhang et al[14], 2023 | Suspected choledocholithiasis | Not applicable (pre-endoscopy prediction) | ModelArts AI platform (Huawei); 7 machine learning models also tested | Predictive classification of CBD stones before cholecystectomy | 1199 patients with symptomatic gallstones; retrospective, single center | ModelArts AI: Accuracy 0.97, recall 097, precision 0.971, F1 score 0.97 | May outperform guideline based risk stratification; reduces unnecessary ERCP |
Wu et al[38], 2021 | EGC | EGD | ENDOANGEL system (CNNs + deep reinforcement learning) | Real time monitoring of blind spots and detection of EGC | 1050 patients in multicenter RCT; 196 gastric lesions biopsied | Accuracy: 84.7%, sensitivity: 100%, specificity: 84.3% | Reduced blind spots, improved EGD quality, potential for real time EGC detection in clinical setting |
Rondonotti et al[39], 2023 | DRSPs ≤ 5 mm | Colonoscopy with blue light imaging | CAD EYE (Fujifilm, Tokyo, Japan), CNN based real time system | Optical diagnosis to support “resect and discard” strategy | 596 DRSPs in 389 patients, 4 center prospective study (Italy) | NPV: 91.0%, sensitivity: 88.6%, specificity: 88.1%, accuracy: 88.4% | Meets ASGE PIVI thresholds; may enable safe omission of histology in DRSPs, especially beneficial for nonexperts |
Koh et al[40], 2023 | Colonic adenomas including SSA | Colonoscopy | GI Genius™ (CADe system, Medtronic, MN, United States) | Real time detection of colonic polyps and ADR improvement | 298 colonoscopies; 487 AI “hits”; 250 polyps removed | Post AI ADR: 30.4% vs baseline 243% (P = 0.02); SSA rate: 5.6% | Enhanced ADR even in experienced endoscopists; improved SSA detection; supports AI use in routine colonoscopy |
Yuan et al[41], 2022 | Gastric lesions (EGC, AGC, SMT, polyp, PU, erosion) | White light endoscopy | YOLO based DCNN model | Multiclass diagnosis of six gastric lesions + lesion free mucosa | 31388 images (29809 train/1579 test) from 9443 patients | Overall accuracy: 85.7%; EGC: Sensitivity 59.2%, specificity 99.3%; AGC: Sensitivity 100%, specificity 98.1% | Comparable to senior endoscopists; improved diagnostic accuracy and efficiency; potential for real time support in diverse gastric lesion detection |
Munir et al[42], 2024 | Not applicable (survey based assessment) | ChatGPT | Evaluation of AI responses to perioperative GI surgery questions | 1080 responses assessed by 45 surgeons | Majority graded “fair” or “good” (57.6%); highest “very good/excellent” rate for cholecystectomy (45.3%) | ChatGPT may aid in patient education, but only 20% deemed it accurate; limited utility in reducing message load | |
Sudarevic et al[43], 2023 | Colorectal polyps | Colonoscopy | Poseidon system (EndoMind + waterjet based AI) | AI based in situ measurement of polyp size using waterjet as reference | 28 polyps in silicone model + 29 polyps in routine colonoscopies | Median error: Poseidon 7.4% (model), 7.7% (clinical); visual: 25.1%/22.1%; forceps: 20.0% | Significantly improved sizing accuracy; does not require additional tools; useful for clinical polyp surveillance and resection decisions |
Tsuboi et al[44], 2020 | Small bowel angioectasia | Capsule endoscopy (PillCam SB2/SB3, Medtronic, MN, United States) | CNN (single shot multibox detector) | Automatic detection of angioectasia in CE images | 2237 training images, 10488 validation images (488 angioectasia, 10000 normal) | AUC: 0.998; sensitivity: 98.8%, specificity: 98.4%, PPV: 75.4%, NPV: 99.9% | Enables high accuracy detection of angioectasia; may reduce oversight and physician workload during capsule reading |
Chang et al[45], 2022 | Upper GI endoscopy (EGD) | ResNeSt deep learning model | Evaluate photodocumentation completeness via anatomical classification | 15305 training images; 15723 test images from 472 EGD cases | Accuracy: 96.64% (deep learning model), Photodocumentation rate: 78% (esophagus duodenum), 53.8% (pharynx duodenum) | Enables automated auditing of image completeness; higher completeness linked to higher ADR; applicable for routine EGD quality control | |
Hwang et al[46], 2021 | Small bowel hemorrhagic and ulcerative lesions | CE | VGGNet based CNN + Grad CAM | Classification and localization of hemorrhagic vs ulcerative lesions | 30224 abnormal + 30224 normal images (train); 5760 images (validation) | Combined model: Accuracy 96.83%, sensitivity 97.61%, specificity 96.04%, AUROC approximately 0.996 | Enhanced lesion localization without manual annotation; Grad CAM improves interpretability; supports efficient clinical CE analysis |
Jazi et al[47], 2023 | Not applicable (survey + clinical scenarios) | ChatGPT 4 (LLM by OpenAI) | Assess alignment of ChatGPT 4 with expert opinions on bariatric surgery suitability and recommendations | 10 patient scenarios; 30 international bariatric surgeons | Expert match: 30%; ChatGPT 4 inconsistency: 40%; recommended surgery in 60% vs experts 90% | ChatGPT 4 showed limited alignment and inconsistency; suitable for education, but not yet reliable for clinical decision making | |
Meinikheim et al[48], 2024 | BERN | Upper GI endoscopy (video based) | DeepLabV3+ with ResNet50 backbone (clinical decision support system) | Evaluate add on effect of AI on endoscopist performance in BERN detection | 96 videos from 72 patients; 51273 images (train); 22 endoscopists from 12 centers | AI alone: Sensitivity 92.2%, specificity 68.9%, accuracy 81.3%; nonexperts with AI: Sensitivity up from 69.8% to 78.0%, specificity up from 67.3% to 72.7% | AI significantly improved nonexperts’ diagnostic performance and confidence; comparable accuracy to experts; highlights human AI interaction dynamics |
Ahmad et al[49], 2021 | Colonoscopy | Identify top research priorities for AI implementation in colonoscopy | 15 international experts from 9 countries; 3 Delphi rounds | Not performance focused; methodology scores used for consensus | Provides a structured framework to guide future AI implementation research in colonoscopy; emphasizes clinical trial design, data annotation, integration, and regulation | ||
Lazaridis et al[50], 2021 | CE | Assess adherence to ESGE guidelines and future perspectives on CE use | 217 respondents from 47 countries via ESGE survey | Not model based; survey: 91% performed CE with appropriate indication; 84.1% classified findings as relevant/irrelevant | Highlights variation in guideline adherence; AI identified as top development priority (56.2%); suggests need for standardization and formal CE training | ||
Tian et al[51], 2024 | EUS | CNN with attention module | Automatic identification of 14 standard BPS anatomical sites on EUS | 6230 training images (1812 patients), internal: 1569 images (47 patients), external: 85322 images (131 patients from 16 centers) | Sensitivity: 89.45%-99.92%, specificity: 93.35%-99.79%, accuracy (internal): 92.1%-100%, kappa: 0.84-0.98 | Outperforms beginners, comparable to experts; enables efficient, high quality anatomical identification in EUS; potential for training and standardization | |
He et al[52], 2020 | Upper GI endoscopy (EGD) | CNN models (DenseNet 121, ResNet 50, VGG, etc.) | Automated classification of 11 anatomical sites for quality control and reporting | 3704 images from 211 routine EGD cases (Tianjin Medical University Hospital, Tianjin, China) | (DenseNet 121): Accuracy approximately 91.11%, F1 scores up to 94.92% for specific sites | Supports automated quality assurance in EGD via accurate site classification; aids report generation and completeness verification |
Table 6 Artificial intelligence based and robotics enhanced surgical education and performance assessment
Ref. | Surgical procedure/task | AI method/model | Assessment modality | Data type/source | Performance metrics | Educational outcome |
Garfinkle et al[53], 2022 | Gastrointestinal and endoscopic surgery (priority setting) | Survey/Delphi | Survey data from SAGES member surgeons | Identified core needs: Video training, tech adoption | ||
Huo et al[54], 2024 | Surgical decision making for GERD | ChatGPT 3.5/4, copilot, Google Bard, Perplexity AI | Prompt based guideline comparison | Standardized clinical vignettes based on SAGES guidelines | Surgeons (accuracy): Bard 6/7 (85.7%), ChatGPT 4 5/7 (71.4%); patients (accuracy): Bard 4/5 (80.0%), ChatGPT 4 3/5 (60.0%); children: Copilot & Bard 3/3 (100.0%) | Revealed inconsistencies in LLM advice; need for medical domain training |
Huo et al[55], 2024 | Surgical management of GERD | Generic ChatGPT 4 vs customized GPT (GTS) | Prompt based guideline comparison | 60 surgeon cases and 40 patient cases based on SAGES and UEG EAES guidelines | GTS (custom GPT): 100% accuracy for both surgeons (60/60) and patients (40/40), generic GPT 4 66.7% (40/60) for surgeons, 47.5% (19/40) for patients | Demonstrated impact of domain customization in LLMs |
Nasir et al[56], 2021 | Robotic rectal cancer surgery | No AI model used (robotics only) | RCTs, observational studies, registry data | Reduced conversion to open surgery (especially in obese/male patients); improved urogenital function; no difference in long term oncologic outcomes | Reinforced need for structured robotic training (e.g., EARCS) |
- Citation: Tasci B, Dogan S, Tuncer T. Artificial intelligence in gastrointestinal surgery: A systematic review. World J Gastrointest Surg 2025; 17(8): 109463
- URL: https://www.wjgnet.com/1948-9366/full/v17/i8/109463.htm
- DOI: https://dx.doi.org/10.4240/wjgs.v17.i8.109463