This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Author contributions: Lo B and Burisch J authors have made a significant contribution to the research described in this manuscript; all authors approved the final manuscript as well as the authorship list.
Conflict-of-interest statement: Lo B has received a lecture fee from Janssen-Cilag; Burisch J has received consulting fees from Celgene, Janssen-Cilag, AbbVie, Vifor Pharma, Jansen and Ferring; lecture fees from Abbvie, Pfizer, MSD, Pharmacosmos and Takeda Pharma, and unrestricted grant support from Takeda Pharma and Tillotts Pharma.
Open-Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Bobby Lo, MD, Doctor, Gastrounit, Medical Section, Copenhagen University Hospital Hvidovre, Kettegård Alle 30, Hvidovre 2650, Denmark. email@example.com
Received: April 28, 2021 Peer-review started: April 28, 2021 First decision: June 13, 2021 Revised: June 27, 2021 Accepted: August 16, 2021 Article in press: August 16, 2021 Published online: August 28, 2021
Assessment of endoscopic disease activity can be difficult in patients with inflammatory bowel disease (IBD) [comprises Crohn's disease (CD) and ulcerative colitis (UC)]. Endoscopic assessment is currently the foundation of disease evaluation and the grading is pivotal for the initiation of certain treatments. Yet, disharmony is found among experts; even when reassessed by the same expert. Some studies have demonstrated that the evaluation is no better than flipping a coin. In UC, the greatest achieved consensus between physicians when assessing endoscopic disease activity only reached a Kappa value of 0.77 (or 77% agreement adjustment for chance/accident). This is unsatisfactory when dealing with patients at risk of surgery or disease progression without proper care. Lately, across all medical specialities, computer assistance has become increasingly interesting. Especially after the emanation of machine learning – colloquially referred to as artificial intelligence (AI). Compared to other data analysis methods, the strengths of AI lie in its capability to derive complex models from a relatively small dataset and its ability to learn and optimise its predictions from new inputs. It is therefore evident that with such a model, one hopes to be able to remove inconsistency among humans and standardise the results across educational levels, nationalities and resources. This has manifested in a handful of studies where AI is mainly applied to capsule endoscopy in CD and colonoscopy in UC. However, due to its recent place in IBD, there is a great inconsistency between the results, as well as the reporting of the same. In this opinion review, we will explore and evaluate the method and results of the published studies utilising AI within IBD (with examples), and discuss the future possibilities AI can offer within IBD.
Core Tip: Artificial intelligence (AI) is on the rise in inflammatory bowel diseases (IBD). Endoscopic evaluation is so far the most studied modality with promising results. Studies with others or the combination of several modalities have been carried out with moderate results leaving room for future research. Data availability and standardisation of the reporting of these new models seem to be the biggest challenges for AI's breakthrough within IBD. International consensus in the field is required to optimise research in AI.
Citation: Lo B, Burisch J. Artificial intelligence assisted assessment of endoscopic disease activity in inflammatory bowel disease. Artif Intell Gastrointest Endosc 2021; 2(4): 95-102
The inflammatory bowel diseases (IBD), which mainly consist of Crohn's disease (CD) and ulcerative colitis (UC), are idiopathic immune-mediated diseases usually affecting young adults[1,2].
Currently, colonoscopy is considered the gold standard in the disease assessment of patients with UC as well as CD located in the terminal ileum and/or colon[3,4]. Disease activity of UC is assessed using scoring systems such as the Mayo Endoscopic Subscore (MES) or UC Endoscopic Index of Severity. Despite their widespread use and being easy to use, both indices suffer from moderate to high inter-observer variation which reduces the credibility of the scores. This has been demonstrated in clinical trials where up to one-third of patients deemed eligible for inclusion based on the MES did not live up to the inclusion criteria after reassessment. Even central reading is associated with noteworthy inter-observer variation[7,8].
In CD, the CD Endoscopic Index of Severity and Simple Endoscopic Score for CD are currently the most used indices. Both have demonstrated varying observer variance with central reading improving the inter-observer variation[9-11]. Capsule endoscopy (CE) for evaluating the small bowel can be scored using the Lewis score. While widely used, the interobserver agreement between parameters in the index fluctuates widely (kappa 0.37-0.83)[13,14].
These interobserver variations and the risk of misclassification has led to the exploration of artificial intelligence (AI) assisted endoscopic assessment, especially in the field of colon cancer detection[16,17]. AI, depending on which method is used, mimics the human brain by having interconnected neurons that process the information given; however, in contrast to the human brain, AI can theoretically process an unlimited number of variables. In the field of IBD, the use of AI remains limited although it has received increasing attention. In the following review, we will discuss the use of AI-assisted assessment of endoscopic disease activity among CD and UC patients from a clinical perspective, the challenges the model faces and unexplored areas where AI has the potential to help patients and physicians.
CD can be examined using many modalities. Imaging has been an area of interest in terms of AI - especially CE. A CE camera takes between 2-4 frames per second and has an approximate transit time of 250 min which can result in a total of approximately 60000 images. One of the challenges CE entails is that it is a time-consuming process whereby a trained person must subsequently review all images. New AI has since assisted physicians and endoscopists in filtering out non-informative images, thereby leaving an image series where the computer believes there is an area of interest. Since the year 2000, AI has been used to identify polyps/tumours, ulcers, celiac disease, hookworms, angioectasia, and bleeding. Among CD patients, special focus has been on small bowel lesions, erosions and ulceration. The majority of recent studies that have examined the listed parameters use a convolutional neural network - a deep learning method that has been shown to be effective in image recognition[18,20]. Overall, these studies have shown an accuracy of > 90% which must be considered close to perfect. However, the majority of these studies are conducted retrospectively and prospective results are wanted to demonstrate the models potential in clinical practice.
Due to UC only involving the colon it has been easier to categorise these patients than CD according to the extent and severity of inflammation. Accordingly, most advances regarding AI in IBD has been done in UC and several clinical tools have been developed to assess the endoscopic disease severity. Such models have achieved an accuracy of 56%–77% in assessing the disease severity according to the MES or UC Endoscopic Index of Severity which was comparable to IBD experts[22-27]. The majority of studies have used methods such as the convolutional neural network to categorize images taken during a colonoscopy or sigmoidoscopy according to the MES. Recently, studies have also investigated the applicability of AI on videos; demonstrating a promising area under the receiver operating characteristic curve (AUROC or AUC) > 90%[24,26,27].
Currently, the available models are unable to distinguish between the different levels of the MES with sufficient accuracy. However, this is an area under great development and it is expected that within the coming years a model will be able to distinguish between the different MES levels with a satisfactory result and thereby eliminate the inter-observer variance, and standardize the clinical and academic evaluation of the endoscopic disease severity.
Few studies have further examined their model's MES score in relation to histological findings[29,30]. One study used endocytoscopy with a support vector machine and achieved an accuracy of approximately 90% in predicting histological findings which must be considered excellent results. Endocytoscopy is, however, not an integral method in most clinics. Furthermore, although the study group utilized both a training and a test set, the training and optimizing process of the models is not described, leaving the reader with uncertainty with regard to e.g., model selection and tuning of. Finally, samples were divided into active inflammation vs remission which might be too simplified a way of considering both the endoscopic and histological findings. Similar results were demonstrated by Takenaka et al with white-light endoscopy, but with the same challenges. Ultimately, none of these studies validated the results on an independent cohort analyzed by independent experts, in order to test the performance of their model when compared to another population or to the point of view of different experts.
POTENTIAL AND DIFFICULTIES
As previously mentioned, AI has been shown to have great potential in the evaluation of endoscopic severity among patients with CD and UC. The models have shown to be at a level with or better than physicians to classify endoscopic disease severity; especially among UC patients. Uniformity in the approach to the endoscopic procedure will make new clinical tools more credible and hopefully lead to less discrepancy between clinical and observational studies. However, it is crucial that new models are developed for clinical purposes, which can be implemented more quickly, thereby reducing the gap between research and clinical practice.
Besides endoscopic evaluation, disease prediction in IBD has also been investigated using AI models. Waljee et al[32,33] used two clinical trial databases to predict C-reactive protein < 5 mg/L after 42 wk treatment with ustekinumab and steroid-free remission after 52 wk treatment with vedolizumab among CD patients, respectively. These studies used a combination of demographic, clinical, and biochemical data in a random forest model to predict patients' course after initiation of treatment. The models achieved an accuracy of 42% and 69%, respectively. Furthermore, the same study group investigated the treatment effect of vedolizumab in UC patients. Using a random forest model, the model achieved an accuracy of 58% in predicting corticosteroid-free remission after 52 wk. When grouping UC and CD together, Biasci et al used transcriptomics to identify a blood sample panel of 17 genes with sensitivity and specificity of approximately 73% to predict patients' risk of treatment escalating within 1 year. A 5-year prediction study from Choi et al demonstrated a sensitivity and specificity of 71% for predicting the risk of the use of biologics. In contrast to Biasci et al, this study utilized only demographical, clinical and common laboratory markers. Furthermore, Waljee et al[37,38] attempted twice to predict the treatment effect within 1 year, resulted in an AUC of 79% and 87% and accuracy of 72% and 80%, respectively. A limitation of these studies is that findings are only presented for IBD patients in total and not stratified according to the type of IBD. Despite these efforts, accuracies below at least 80% must be considered insufficient. Furthermore, even with accuracies above 80%, the results must be taken into perspective with the sensitivity, specificity and AUC to achieve an overall picture of the model's performance. Unfortunately, the majority of the studies have only reported some but not all measures of validity of which AUC is most commonly reported.
It is not uncommon for some patients to undergo a lengthy diagnostic process before a definite diagnosis of CD or UC can be made. This can be a challenge for both physicians and patients, and result in over or under treatments with major consequences for the patient. Recent studies using AI have attempted to use several modalities to better distinguish between these patients: endoscopy, histology, genetic markers, biochemical markers, clinical factors, omics, or a combination of one or more of these modalities[40-43]. These have shown acceptable results with AUC and accuracy of > 80%. It should be emphasized that these studies do not always report all results and many of the results are from validation data and not necessarily test data (unseen data) exposing the models to overfitting. However, to our knowledge, none of these models has been applied in clinical practice and real-life data are warranted to evaluate their efficacy.
To our knowledge, no other modalities explored in connection with AI have been published to date. In particular, the complexity of CD results in several challenges when developing new AI models. One area that remains untouched is the use of AI during colonoscopy in CD patients. This could be due to challenges in the endoscopic disease assessment of CD as the disease can be patchy and the severity varies between patches. Besides, indices for CD are difficult or time-consuming to use in clinical practice. This could be accommodated by developing new scoring indices based on an evaluation from an AI model, allowing the possibility of assessing the gut as a whole rather than the segmented method currently being used.
In addition to endoscopies for both UC and CD, modalities such as ultrasound, magnetic resonance imaging, colon CE and computed tomography are obvious opportunities for the development of new clinical tools.
Unfortunately, this field is also challenged by several issues. First and foremost, a paradigm shift is needed; from a medical professional to a computer-aided assessment. This will first and foremost require doctors to accept the new technology which can be difficult to understand as the latest AI architectures use deep learning where a black-box appears (the process between input and output). As it is not 100% possible to account for what happens in this black-box, mistrust might arise among the clinicians toward the models. Despite different ways of explaining the black-box, mathematically and illustratively, it is only possible to give an estimate of its process.
Secondly, medical education may need to be reorganized in the future to have more focus on interpretation and critical evaluation of the results of these new models. The medical field has experienced a similar paradigm shift before with the introduction of the World Wide Web. This gave patients equal access to knowledge that physicians had and doctors went from being the ultimate definitive truth to now having to explain how the symptoms and the disease are connected and which diagnosis and disease courses are most likely. However, a new organization of the medical education in connection with AI may require interdisciplinary involvement with, among others, bioinformatics and computer scientists to better equip doctors to interpret and critically evaluate the models' output.
Thirdly, larger amounts of data are needed – more than previously accustomed to developing these new models. However, the amount of data needed varies significantly in relation to the outcome and the methods used and no specific number of required data exists. As data is resource demanding, the estimate must be adjusted to what is clinically possible. In recent years, cross-border collaborations have been formed to make large amounts of data available. However, these are rarely freely available and the quality must also be critically evaluated when the workflow and equipment vary markedly between nations. We, therefore, encourage everyone to make their data at least partially accessible - a good example is The HyperKvasir dataset.
Finally, international reporting standards must be set within the field of IBD regarding AI studies. AI is still a relatively unexploited territory within IBD. This has led to great variation in the way the studies report both their methods and results, despite several calls for uniformity. A good example is the endoscopic evaluation of disease severity in UC patients. Often, only AUC is reported, which can be misleading as sensitivity, specificity and accuracy may be only modest. This is due to the fact that the AUC is a measure of how well the true positive can be separated from the rest, while measures of e.g., accuracy hint at the actual performance of the models. Even when the studies report the wanted parameters, the reporting method can vary. For example, calculating the sensitivity, specificity and accuracy for each class rather than reporting the overall sensitivity, specificity and accuracy for the entire index. We, therefore, encourage that future articles as a minimum must report the information and parameters described in Table 1.
Table 1 Recommendations for reporting of studies regarding artificial intelligence.
Origin of dataset and description of the acquisition process
Definition of ground truth
Split of data set and should include a training, validation and test set. A clear statement that the test set is not used to tune hyperparameters or in the selection of the model
Method and architecture used, whether it is pretrained or not, and what dataset it is pretrained on
Full technical detail should be included in supplementary files
Statement of post-selection analyses and why these are conducted
A complete report of all results including but not restricted to AUC, sensitivity, specificity, accuracy and kappa value for the overall model's performance and not for selected tasks
Risks of overfitting and bias
Generalisability and cautions to take
AUC: Area under the receiver operating characteristic curve.
In addition, international journals should set standards for what is required of future AI studies within the field. The use of previous reporting methods, e.g., STARD guidelines, seems outdated and should be updated to the new technological reality.
AI is on the rise in IBD. Endoscopic evaluation is so far the most studied modality with promising results. Studies with others or the combination of several modalities have been carried out with moderate results leaving room for future research. Data availability and standardization of the reporting of these new models seem to be the biggest challenges for the AI's breakthrough within IBD. International consensus in the field is required to optimize research in AI.
Magro F, Gionchetti P, Eliakim R, Ardizzone S, Armuzzi A, Barreiro-de Acosta M, Burisch J, Gecse KB, Hart AL, Hindryckx P, Langner C, Limdi JK, Pellino G, Zagórowicz E, Raine T, Harbord M, Rieder F; European Crohn's and Colitis Organisation [ECCO]. Third European Evidence-based Consensus on Diagnosis and Management of Ulcerative Colitis. Part 1: Definitions, Diagnosis, Extra-intestinal Manifestations, Pregnancy, Cancer Surveillance, Surgery, and Ileo-anal Pouch Disorders.J Crohns Colitis. 2017;11:649-670.
[PubMed] [DOI][Cited in This Article: ][Cited by in Crossref: 682][Cited by in F6Publishing: 332][Article Influence: 227.3][Reference Citation Analysis (0)]
Peyrin-Biroulet L, Sandborn W, Sands BE, Reinisch W, Bemelman W, Bryant RV, D'Haens G, Dotan I, Dubinsky M, Feagan B, Fiorino G, Gearry R, Krishnareddy S, Lakatos PL, Loftus EV Jr, Marteau P, Munkholm P, Murdoch TB, Ordás I, Panaccione R, Riddell RH, Ruel J, Rubin DT, Samaan M, Siegel CA, Silverberg MS, Stoker J, Schreiber S, Travis S, Van Assche G, Danese S, Panes J, Bouguen G, O'Donnell S, Pariente B, Winer S, Hanauer S, Colombel JF. Selecting Therapeutic Targets in Inflammatory Bowel Disease (STRIDE): Determining Therapeutic Goals for Treat-to-Target.Am J Gastroenterol. 2015;110:1324-1338.
[PubMed] [DOI][Cited in This Article: ][Cited by in Crossref: 937][Cited by in F6Publishing: 558][Article Influence: 156.2][Reference Citation Analysis (0)]
Feagan BG, Sandborn WJ, D'Haens G, Pola S, McDonald JWD, Rutgeerts P, Munkholm P, Mittmann U, King D, Wong CJ, Zou G, Donner A, Shackelton LM, Gilgen D, Nelson S, Vandervoort MK, Fahmy M, Loftus EV Jr, Panaccione R, Travis SP, Van Assche GA, Vermeire S, Levesque BG. The role of centralized reading of endoscopy in a randomized controlled trial of mesalamine for ulcerative colitis.Gastroenterology. 2013;145:149-157.e2.
[PubMed] [DOI][Cited in This Article: ][Cited by in Crossref: 140][Cited by in F6Publishing: 105][Article Influence: 17.5][Reference Citation Analysis (0)]
Daperno M, Comberlato M, Bossa F, Armuzzi A, Biancone L, Bonanomi AG, Cassinotti A, Cosintino R, Lombardi G, Mangiarotti R, Papa A, Pica R, Grassano L, Pagana G, D'Incà R, Orlando A, Rizzello F; IGIBDEndo Group. Training Programs on Endoscopic Scoring Systems for Inflammatory Bowel Disease Lead to a Significant Increase in Interobserver Agreement Among Community Gastroenterologists.J Crohns Colitis. 2017;11:556-561.
[PubMed] [DOI][Cited in This Article: ][Cited by in Crossref: 4][Cited by in F6Publishing: 7][Article Influence: 1.3][Reference Citation Analysis (0)]
Esaki M, Washio E, Morishita T, Sakamoto K, Fuyuno Y, Hirano A, Umeno J, Kitazono T, Matsumoto T, Suzuki Y. Su1262 inter- and intra-observer variation of capsule endoscopic findings for the diagnosis of crohn's disease: A case-control study.Gastrointest Endosc. 2018;87:AB302.
[PubMed] [DOI][Cited in This Article: ][Reference Citation Analysis (0)]
Wang Y, He X, Nie H, Zhou J, Cao P, Ou C. Application of artificial intelligence to the diagnosis and therapy of colorectal cancer.Am J Cancer Res. 2020;10:3575-3598.
[PubMed] [DOI][Cited in This Article: ]
Alammari A, Islam ABMR, Oh JH, Tavanapong W, Wong J, De Groen PC.
Classification of ulcerative colitis severity in colonoscopy videos using CNN. ICIME 2017: Proceedings of the 9th International Conference on Information Management and Engineering; 2019 Oct 9-11; Barcelona, Spain. Barcelona Spain, 2017: 139-44.
[PubMed] [DOI][Cited in This Article: ][Cited by in Crossref: 8][Article Influence: 2.0][Reference Citation Analysis (0)]
Lo B, Liu Z, Bendtsen F, Igel C, Vind I, Burisch J.
Deep Learning Surpasses Gastrointestinal Experts at Classifying Endoscopic Severity in Patients with Ulcerative Colitis. 2021 Preprint. Available from: SSRN Electron Journal.
[PubMed] [DOI][Cited in This Article: ][Reference Citation Analysis (0)]
Stephen Hawking: Artificial intelligence could be the greatest disaster in human history. [cited 2021 Apr 11]. Available from: https://www.independent.co.uk/news/people/stephen-hawking-artificial-intelligence-diaster-human-history-leverhulme-centre-cambridge-a7371106.html.
[PubMed] [DOI][Cited in This Article: ]
Buhrmester V, Münch D, Arens M.
Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey. 2019 Preprint. Available from: arXiv:1911.12116.
[PubMed] [DOI][Cited in This Article: ]