Con D, van Langenberg DR, Vasudevan A. Deep learning vs conventional learning algorithms for clinical prediction in Crohn's disease: A proof-of-concept study. World J Gastroenterol 2021; 27(38): 6476-6488 [DOI: 10.3748/wjg.v27.i38.6476]
Corresponding Author of This Article
Danny Con, MD, Doctor, Statistician, Department of Gastroenterology and Hepatology, Eastern Health, 8 Arnold Street, Box Hill 3128, Victoria, Australia. email@example.com
Checklist of Responsibilities for the Scientific Editor of This Article
This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Author contributions: Con D contributed conceptualization, data collection, statistical analysis, data interpretation, manuscript drafting; van Langenberg DR contributed conceptualization, data interpretation, reviewing of manuscript critically for important intellectual content; Vasudevan A contributed conceptualization, data collection, data interpretation, reviewing of manuscript critically for important intellectual content; all authors approved the final version of the manuscript.
Institutional review board statement: This study was reviewed and approved by the Eastern Health Office of Research & Ethics (approval number: LR 61/2015).
Informed consent statement: Patients were not required to give informed consent to the study because the analysis used anonymous clinical data that were obtained retrospectively.
Conflict-of-interest statement: Con D has no relevant conflicts of interest to declare. AV has received financial support to attend educational meetings from Ferring. van Langenberg DR has served as a speaker and/or received travel support from Takeda, Ferring and Shire. He has consultancy agreements with Abbvie, Janssen and Pfizer. He received research funding grants for investigator-driven studies from Ferring, Shire and AbbVie.
Data sharing statement: No additional data are available.
STROBE statement: The authors have read the STROBE Statement-checklist of items, and the manuscript was prepared and revised according to the STROBE Statement-checklist of items.
Open-Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Danny Con, MD, Doctor, Statistician, Department of Gastroenterology and Hepatology, Eastern Health, 8 Arnold Street, Box Hill 3128, Victoria, Australia. firstname.lastname@example.org
Received: March 5, 2021 Peer-review started: March 5, 2021 First decision: April 17, 2021 Revised: April 26, 2021 Accepted: September 6, 2021 Article in press: September 6, 2021 Published online: October 14, 2021
Traditional methods of developing predictive models in inflammatory bowel diseases (IBD) rely on using statistical regression approaches to deriving clinical scores such as the Crohn's disease (CD) activity index. However, traditional approaches are unable to take advantage of more complex data structures such as repeated measurements. Deep learning methods have the potential ability to automatically find and learn complex, hidden relationships between predictive markers and outcomes, but their application to clinical prediction in CD and IBD has not been explored previously.
To determine and compare the utility of deep learning with conventional algorithms in predicting response to anti-tumor necrosis factor (anti-TNF) therapy in CD.
This was a retrospective single-center cohort study of all CD patients who commenced anti-TNF therapy (either adalimumab or infliximab) from January 1, 2010 to December 31, 2015. Remission was defined as a C-reactive protein (CRP) < 5 mg/L at 12 mo after anti-TNF commencement. Three supervised learning algorithms were compared: (1) A conventional statistical learning algorithm using multivariable logistic regression on baseline data only; (2) A deep learning algorithm using a feed-forward artificial neural network on baseline data only; and (3) A deep learning algorithm using a recurrent neural network on repeated data. Predictive performance was assessed using area under the receiver operator characteristic curve (AUC) after 10× repeated 5-fold cross-validation.
A total of 146 patients were included (median age 36 years, 48% male). Concomitant therapy at anti-TNF commencement included thiopurines (68%), methotrexate (18%), corticosteroids (44%) and aminosalicylates (33%). After 12 mo, 64% had CRP < 5 mg/L. The conventional learning algorithm selected the following baseline variables for the predictive model: Complex disease behavior, albumin, monocytes, lymphocytes, mean corpuscular hemoglobin concentration and gamma-glutamyl transferase, and had a cross-validated AUC of 0.659, 95% confidence interval (CI): 0.562-0.756. A feed-forward artificial neural network using only baseline data demonstrated an AUC of 0.710 (95%CI: 0.622-0.799; P = 0.25 vs conventional). A recurrent neural network using repeated biomarker measurements demonstrated significantly higher AUC compared to the conventional algorithm (0.754, 95%CI: 0.674-0.834; P = 0.036).
Deep learning methods are feasible and have the potential for stronger predictive performance compared to conventional model building methods when applied to predicting remission after anti-TNF therapy in CD.
Core Tip: Deep learning has vast potential, but its clinical utility in predicting outcomes in Crohn’s disease (CD) has not been explored. This study showed that deep learning algorithms (a recurrent neural network) using a more complex information structure including repeated biomarker measurements had a better predictive performance compared to a conventional statistical algorithm using only baseline data. This proof-of-concept study therefore paves the way for further research in the use of deep learning methods in clinical prediction in CD.
Citation: Con D, van Langenberg DR, Vasudevan A. Deep learning vs conventional learning algorithms for clinical prediction in Crohn's disease: A proof-of-concept study. World J Gastroenterol 2021; 27(38): 6476-6488
Crohn's disease (CD) is a heterogeneous chronic inflammatory bowel disease (IBD) that is characterized by intermittent flares, medication changes, the potential need for surgery and substantial psychological morbidity[1,2]. As with many chronic conditions, predicting disease trajectory, outcomes and response to therapies in CD are key components of clinical practice where management is tailored to the individual. Precision medicine has been in part driven by the vast expansion of available electronic health data, genomic data and novel disease biomarkers. However, deciphering the complex relationships between large amounts of information and multiple data types presents new analytical challenges.
Traditional approaches to constructing prediction models rely on multivariable regression approaches, typically logistic regression for classification or proportional hazards regression for longitudinal prediction. The resulting predictive models are thus typically only linear combinations of the included predictors and may have limited ability to learn more complex relationships within the data. The advantage of machine learning and artificial intelligence over traditional predictive tools is the potential ability for computational algorithms to automatically find and learn complex, hidden relationships between predictive markers and outcomes[5,6]. This is especially true for deep learning or artificial neural network (ANN) methods, although their 'black box' approach has been criticized for an inability to produce a causal explanation between predictors and outcomes.
Despite some limitations, there is much interest in developing and testing machine learning and deep learning tools to aid decision making[5,7]. In luminal gastroenterology, machine learning is gaining traction but its use has been relatively limited to automatic image recognition in endoscopy[8-11] as well as feature selection in genomic and microbiomics data[12,13]. Although there has been great interest in predicting clinical outcomes in CD such as response to therapeutics including biologics[14-18] and immunomodulators[19,20], studies investigating the utility of machine learning models for such predictive tasks have been more limited[21-23]. In particular, the utility of deep learning or ANNs specifically in clinical prediction of CD remains unknown.
We aimed to evaluate the utility of deep learning algorithms compared with conventional statistical learning algorithms for clinical prediction in this proof-of-concept study. In particular, we aimed to compare these algorithms as methods of learning and prediction in a general sense, rather than to develop any specific predictive model or score.
MATERIALS AND METHODS
This proof-of-concept study utilized a retrospective longitudinal cohort at a tertiary health network comprising three acute hospitals in Melbourne, Australia. The focus of the study was to compare the ability of two supervised learning algorithms (conventional statistical learning vs deep learning) to predict remission after 12 mo of treatment using clinical variables and biomarkers available at baseline. The performance of each algorithm was evaluated using cross-validation. The emphasis of the study was to compare the predictive performance of the two methods of learning rather than any specific model itself. This study was approved by the Eastern Health Office of Research & Ethics (approval number: LR 61/2015).
All adult patients > 18 years with confirmed CD according to standard criteria were included if they were commenced on treatment with an anti-tumor necrosis factor (anti-TNF) agent (adalimumab or infliximab) for luminal CD and received at least one dose of the drug between January 2010 and December 2015. Patients receiving anti-TNF for perianal disease without luminal disease were excluded. Patients were followed up for 12 mo to determine rates of biochemical remission.
Response to anti-TNF was defined as having achieved biochemical remission as per serum C-reactive protein (CRP) < 5 mg/L at 12 mo. This endpoint was chosen because CRP is an accepted biomarker to reflect disease activity and predict outcomes in CD[25,26]. Additionally, normalization of CRP predicts better outcomes in CD patients in remission[27,28]. The first CRP measurement after 12 mo and before 18 mo was used. Patients who did not have a CRP measurement in this time period were excluded.
Data collection and pre-processing
Baseline characteristics were collected via hospital and clinic records, including Montreal classification, concomitant baseline therapies, prior anti-TNF exposure and prior surgeries. Biomarker data were collected at two time points: (1) A baseline measurement defined as the most proximate measurement prior to commencing anti-TNF, up to 3 mo before commencement; and (2) A prior measurement defined as the second most proximate measurement, up to 12 mo before commencement. Only patients with complete baseline data were included, while missing prior values were imputed with the respective baseline value. The following variables were log-transformed to correct skewness: serum bilirubin, alanine aminotransferase, alkaline phosphatase and gamma-glutamyl transferase (GGT). The data underlying this article cannot be shared publicly due to privacy and ethical concerns. The data will be shared upon reasonable request to the corresponding author.
The conventional approach to developing a predictive clinical model is to run univariable and multivariable regression analysis to find useful and preferably independent predictors of the outcome of interest (see Figure 1). Criteria for variable selection usually involves significance testing (P values) or likelihood-based information criterion (such as the Akaike information criterion). In this study, logistic regression was used given the dichotomous nature of the outcome (CRP < 5 mg/L vs CRP ≥ 5 mg/L). The conventional approach typically only uses data from a single time-point, therefore we used baseline data only (the most proximate measurement for all biomarkers). For this conventional approach, we employed the following modelling algorithm: (1) Perform univariable logistic regression on each variable and retain all variables with P < 0.5; (2) Run backwards stepwise selection on all retained variables with removal criterion P > 0.2; and (3) Use the regression coefficients in the remaining multivariable model to derive the predictive score.
Figure 1 Comparison of the predictive modelling process using two supervised learning algorithms.
A: Conventional statistical learning; B: Deep learning.
Deep learning algorithms (experimental approach)
A basic deep learning algorithm is a feed-forward ANN. An ANN is composed of layers: an input layer (consisting of all the input predictor variables), an output layer (the prediction), and a number of 'hidden' layers (see Figure 1). Nodes within a hidden layer are called 'neurons'. The hidden layers allow an ANN to learn complex, non-linear relationships between input variables and the outcome of interest. The influence of nodes in a layer on other nodes in subsequent layers is ‘trained’ or fitted using a mathematical function and ultimately determines how information is propagated through the ANN — this is analogous to fitting a regression line on data in conventional statistics. An ANN with only an input and output layer, without hidden layers, can be analogous to simple logistic regression, although they are not equivalent.
However, like the conventional statistical algorithm, a basic feed-forward ANN is still only able to model relationships between predictors at a single time-point. A recurrent neural network (RNN) is a more advanced deep learning algorithm that is able to model repeated measurements over time. Like a feed-forward ANN, information is propagated from the input layer to the output layer. However, instead of only allowing the information to pass through once, information is fed to the RNN sequentially, or 'recurrently' — that is, each set of repeated measurements is inputted once at a time allowing the RNN to update its knowledge of the relationship between the predictors and the outcome. Therefore, the algorithm is additionally able to learn and utilize the dynamics of biomarkers over time, in a way that cannot be achieved by conventional statistical learning methods.
We tested the feed-forward ANN and the RNN in three separate experiments: (1) Using all baseline clinical data in a feed-forward ANN; (2) Using only baseline biomarker data in a feed-forward ANN; and (3) Using repeated biomarker data in an RNN. In this study after hyper-parameter tuning, we used a feed-forward ANN architecture of 3 hidden layers, each with 64 neurons, and an RNN architecture of 1 hidden layer with 64 neurons.
Comparison of algorithms
The predictive performances of the conventional statistical algorithm and the experimental deep learning algorithm (ANN) was defined as their ability to correctly classify 12-mo CRP < 5 mg/L measured using the area under the receiver operator characteristic curve (AUC). Because the learning ability of an ANN can be arbitrarily increased, an overly powerful ANN that is trained such that it has near-perfect prediction on the original training cohort, would suffer from poor predictive ability in an external cohort (this is called ‘over-fitting’, a well-known phenomenon). Similarly, the same conventional statistical learning algorithm might result in models with different variables when applied to different cohorts. Therefore, it is important to evaluate the ability of a learning algorithm to predict outcomes in patients that are not included in the original training cohort (external validity).
In the absence of external testing cohorts to assess external validity, cross-validation is an internal validation procedure that is suited to this purpose. During cross-validation, the cohort is randomly divided into k equally sized sub-cohorts, known as ‘folds’ (where k is often 5 or 10 by convention). Then, one fold is set aside to be used to test the algorithm, after the algorithm is first trained on the remaining k-1 folds (see Figure 2). This allows the algorithms to be tested on patients that were not used during training. The process is then repeated for each fold (where each fold takes turns in being the test fold). The average AUC after repeating k times gives the cross-validated AUC. However, this procedure is not free from error, because the partitioning process may have randomly resulted in a better (or worse) than usual performance. Thus it is important to repeat the whole process a number of times, to reduce this error.
Figure 2 Schematic diagram of k-fold cross validation procedure for k = 5.
This method is considered more reliable than a random train-test split, which would be analogous to training only one model, instead of the average of k models. AUC: Area under the receiver operator characteristic curve.
For this study, we used 5-fold cross-validation repeated 10 times to estimate the generalizability of each algorithm on unseen data. Statistical comparison of the cross-validated AUCs of each learning algorithm was made using the variance-corrected repeated k-fold t test instead of a conventional paired t test because of the independency violation from repeated partitioning of the same dataset. For comparison, the naïve or apparent AUC of each model after training and testing on the same entire cohort was given, however this is non-informative. Sample size calculations were conducted only as a guide given the exploratory nature of the study and without prior similar studies on which to base AUC assumptions. The target sample size to detect a 10% difference in AUC with 80% power and 95% significance assuming an AUC variance of 10% was n = 157. To instead detect a 15% difference in AUC under the same conditions, a sample size of n = 70 was required. The Python 3.8.4 programming language with the open-source module PyTorch was used to create the deep learning algorithm. Stata/IC 16 (Texas, United States, 2020) was used to create the statistical learning algorithm.
A total of 146 CD patients were included (see Table 1). Their median age was 36 years [inter-quartile range (IQR) 25-50], 48% were male and median disease duration since diagnosis was 5 years (IQR 1-12). The anti-TNF commenced was infliximab in 58% and adalimumab in 42%. Concomitant therapy at anti-TNF commencement included thiopurines (68%), methotrexate (18%), corticosteroids (44%) and aminosalicylates (33%). Over a quarter of patients (28%) had prior intestinal surgery, while 15% had prior exposure to anti-TNF. After 12 mo, 94 (64%) patients were in biochemical remission (CRP < 5 mg/L).
Table 1 Baseline characteristics of study cohort (n = 146).
Univariable analysis: Baseline factors associated with biochemical remission at 12 mo on univariable testing included non-complex disease behavior (B1), higher albumin and mean corpuscular hemoglobin concentration (MCHC), and lower platelets, lymphocytes and monocytes (each P < 0.05; see Table 2), while lower neutrophil count was nearly significant (P = 0.06). There was no significant association with age, sex, disease location or baseline medical therapies (see Table 2).
Table 2 Estimated odds ratios with 95% confidence intervals on univariable and multivariable logistic regression analysis.
Adj. OR (95%CI)
Age, per year
Male (vs female)
Ileal location (L1)
Complex disease (B2/B3)
Anti-TNF type: infliximab (vs adalimumab)
Prior intestinal surgery
Disease duration, per loge year
Albumin, per g/L
Hemoglobin, per g/L
HCT, per %
RCC, per 109/L
MCV, per fL
MCH, per pg/cell
MCHC, per mg/L
Platelets, per 100 × 109/L
Neutrophils, per 109/L
Lymphocytes, per 109/L
Monocytes, per 109/L
Eosinophils, per 109/L
Basophils, per 0.01 × 109/L
Bilirubin, per loge µmol/L
ALT, per loge IU/L
ALP, per loge IU/L
GGT, per loge IU/L
Variables excluded after univariable regression are in grey; variables excluded after stepwise selection are marked with a dash. CI: Confidence interval; OR: Odds ratio; CD: Crohn's disease; TNF: tumor necrosis factor; HCT: Hematocrit; RCC: Red cell count; MCV: Mean corpuscular volume; MCH: Mean corpuscular hemoglobin; MCHC: Mean corpuscular hemoglobin concentration; ALP: Alkaline phosphatase; ALT: Alanine aminotransferase; GGT: Gamma-glutamyl transferase.
Multivariable analysis: After backward stepwise selection, the following variables remained in the final multivariable model: Complex disease, baseline albumin, monocytes, lymphocytes, MCHC and GGT (see Table 2). The resulting prediction model was given by the following equation (coefficients correct to two significant figures): Score = 0.079 × (albumin, g/L) + 0.050 × (MCHC, mg/L) - 1.1 × (monocytes, 109/L) - 0.43 × (lymphocytes, 109/L) - 1.0 × (complex disease, y=1|n=0) - 0.69 × loge(GGT, IU/L).
Outcome prediction: After 10× 5-fold cross validation, the average AUC of the statistical learning algorithm was 0.659 [95% confidence interval (CI): 0.562-0.756]. This suggests the statistical learning algorithm is expected to accurately classify 65.9% of patients in external cohorts who have similar characteristics to the study cohort (see Table 3). The algorithm performed better than chance (AUC > 0.5) 94% of the time and had an AUC > 0.7 in 38% of occasions (see Figure 3). The apparent naïve AUC (when trained and tested on the same data) of the model was 0.771.
Figure 3 Distribution of area under the receiver operator characteristic curve after 10 × 5 fold cross validation.
A: Conventional statistical learning algorithm (mean 0.659, SD 0.095); B: Recurrent neural network (mean 0.754, SD 0.078); C: Head-to-head comparison, matched at each fold and repetition (mean difference, + 0.095, P = 0.036). AUC: Area under the receiver operator characteristic curve.
Table 3 Comparison of learning algorithms during cross-validation experiments.
1Clinical data refers to non-biochemical data such as age, sex, disease characteristics and concurrent treatments. Biomarker data refers to complete blood count, liver function tests and albumin.
2P value for comparison against conventional statistical algorithm, using the variance-corrected repeated k-fold t test. AUC: Area under the receiver operator characteristic curve; ANN: Artificial neural network.
Deep learning algorithms
Feed-forward ANN with complete baseline data: The feed-forward ANN with complete baseline data had a cross-validated AUC of 0.710 (95%CI: 0.622-0.799) (see Figure 3 and Table 3). This difference was not statistically significant using the variance corrected t test (P = 0.25). The algorithm performed better than chance 100% of the time and had good performance (AUC > 0.7) 54% of the time (see Figure 3). For comparison, the naïve AUC of the model was 0.857.
Feed-forward ANN with baseline biomarker data only: The same feed-forward ANN using only baseline biomarker data had a similar cross-validated AUC of 0.706 (95%CI: 0.621-0.791), which was again not significantly different compared to the conventional algorithm (P = 0.33) (see Table 3). The algorithm performed better than chance 100% of the time and had good performance (AUC > 0.7) 58% of the time (see Figure 3). The naïve AUC of the model was 0.776.
RNN with repeated biomarker data: The same feed-forward ANN using only baseline biomarker data had a similar cross-validated AUC of 0.754 (95%CI: 0.674-0.834), which was significantly higher than the AUC of the conventional algorithm (P = 0.036) (see Table 3). This suggests the RNN is expected to accurately classify 75.4% of patients in external cohorts who have similar characteristics to the study cohort. The RNN algorithm performed better than chance 100% of the time and had good performance (AUC > 0.7) 72% of the time (see Figure 3). For comparison, the naïve AUC of the model was 0.892.
The rapid expansion of available health data has motivated the development of machine learning and deep learning tools to predict useful outcomes in clinical medicine[5,6]. The advent of machine learning and data science techniques is especially applicable to IBD due to the heterogeneity and chronic nature of such conditions and the repeated measures of disease activity over time which provides data that may be more suitable for complex modelling techniques. For instance, those with CD typically present with a wide array of disparate disease phenotypes and underlying pathogeneses, and their response to treatment and the trajectory of their disease course varies substantially and changes based on their response. This study has exhibited the potential of deep learning algorithms in predicting response to anti-TNF therapy in patients with CD. The ability to predict the likelihood of response to a given treatment is crucial for risk-benefit assessment, which in turn is crucial to facilitate shared decision making between clinicians and patients. Further, although biologic therapies have revolutionized management in IBD, medical therapy is now the principal driver of healthcare costs[33,34] and health economic considerations will inevitably affect treatment choice. Ideally, patients should receive therapies that are both likely to work and cost-effective. Therefore, there can be no ‘one-size-fits-all’ strategy to management, and precision and personalized medicine are key objectives.
Conventional statistical learning algorithms have generated many useful clinical scores, including the CD activity index, the simple endoscopic score for CD, scores to predict response to biologic therapies, and scores to differentiate CD from intestinal tuberculosis. The advantage of conventional scores is often their simplicity and interpretability. A simple score can be memorized and calculated at the bed side and are intuitive as they utilize important risk factors of the outcome of interest. Yet clinical scores can only apply to a rather generic subgroup of patients and are never specific to any individual, as they utilize relatively few variables. Further, conventional methods are not readily able to model more complex, non-linear or time-dependent health states. With new genomic and microbiomic profiling, as well as the rapid uptake of comprehensive electronic medical records with mass data linkage, the ability of conventional learning algorithms to select useful predictive factors may become redundant.
Although the advantages of deep learning for the analysis of non-numerical data types is obvious, such as image data in endoscopy[39-41] and text or speech data in natural language processing, the utility of deep learning for the analysis of numerical data is less clear but remains promising. A recent study has demonstrated the utility of machine learning in predicting anti-TNF response in rheumatoid arthritis, but relied on genetic markers in addition to clinical data. Another recent study used machine learning to predict whether patients with ankylosing spondylitis required anti-TNF therapy, but did not evaluate whether response to therapy could be predicted. It is anticipated that new data science and machine learning techniques are required to handle large amounts of data for use in clinical practice, although the optimal algorithms for this task remain unknown. Nevertheless, with the provision of comprehensive training data, machine learning tools have the potential to aid in individualized risk prediction, although no such model exists in IBD currently. In our cohort, the RNN deep learning algorithm was able to outperform the conventional algorithm after incorporating repeated biomarker measurements and thus additionally learn the non-linear temporal dynamics of the respective biomarkers — a feat that is not possible with conventional prediction models. It is expected that with enough training data, deep learning methods such as the RNN will be able to incorporate the time series data from multiple repeated health states of an individual patient over time. The clear trade-off with deep learning methods is the need for more data coordination and software to execute. However, the continued uptake of automated medical records in routine clinical practice may mitigate this limitation in future. Further, with the ever increasing breadth and volume of information from sources including comprehensive previous medical history, serum and fecal biomarkers, imaging and endoscopic data as well as genetics, the role of machine learning in prediction in chronic diseases including IBD is likely to expand.
This study has also demonstrated the importance of applying model validation techniques during model development. ANNs and other powerful algorithms have the ability to learn intricate differences in data, yet poorly specified models that focus only on learning power have the propensity to learn the random variations or artefacts in the data, which are present only due to chance. This is evidenced by the RNN in this study achieving excellent AUC during training, but a reduced AUC when tested on unseen data (naïve AUC 0.892; cross-validated AUC 0.754). The same phenomenon occurred with the statistical learning algorithm but to a somewhat lesser extent (naïve AUC 0.771; cross-validated AUC 0.659). Therefore, studies developing predictive models should take care to avoid naïvely assessing predictive performance and ensure that effective cross-validation or bootstrapping methods are used for appropriate interval validation. If available, external validation of predictive models in entirely new and different cohorts is the gold standard for model validation.
The dataset used in this study was retrospective and from a single center which subjects the results to information bias and limits their external validity. The outcome used was biochemical remission as this is a readily available as a repeated measure which allowed demonstration of more conventional and machine learning models, however it is acknowledged that clinical symptoms and/or mucosal healing are more clinically relevant end-points. Nevertheless, the goal of this study was to demonstrate the feasibility of deep learning methods in clinical prediction in this proof-of-concept study, rather than to develop a specific predictive model. Further, in practice, much larger cohorts will be required to properly train and calibrate deep learning models to maximize their utility in the real world. In future, all studies investigating specific predictive models should be subject to prospective controlled validation prior their application in clinical practice, specifically having shown that outcomes are improved after using predictive models to guide management.
In conclusion, we have demonstrated the feasibility of deep learning algorithms for clinical prediction in CD, which demonstrated an improved predictive performance compared to conventional methods. However, conventional statistical methods retain the advantage of simplicity and intuitiveness, allowing their use at the bedside. Yet with the rapid expansion of available health data, machine learning models have the potential to supersede currently conventional methods and greatly improve the development of tools for the clinical prediction of patient outcomes.
Machine learning and artificial intelligence have the potential to revolutionize precision care in inflammatory bowel diseases. The greatest area of interest has been the application of deep learning methods in automatic tumor detection during endoscopy, yet the application of such techniques in clinical outcome prediction has been lacking.
Traditional approaches to clinical prediction rely on conventional statistical algorithms such as regression, which are not suitable for more complex data such as repeated biomarker measurements.
To determine and compare the utility of deep learning with conventional algorithms in predicting response to anti-tumor necrosis factor (anti-TNF) therapy in Crohn's disease (CD).
A retrospective cohort of CD patients commenced on anti-TNF therapy was used to experimentally develop and cross-validate three supervised learning algorithms: (1) Statistical learning algorithm; (2) Feed-forward artificial neural network; and (3) Recurrent neural network with repeated data. Predictive utility was quantified using the area under the receiver operator characteristic curve (AUC).
Within our cohort of 146 patients, the conventional statistical learning algorithm had the weakest performance [AUC 0.659, 95% confidence interval (CI): 0.562-0.756], compared to the feed-forward artificial neural network (AUC 0.710, 95%CI: 0.622-0.799; P = 0.25 vs conventional) and the recurrent neural network using repeated biomarker measurements (AUC 0.754, 95%CI: 0.674-0.834; P = 0.036 vs conventional).
Deep learning methods are feasible and have the potential for stronger predictive performance compared to conventional model building methods when applied to predicting remission after anti-TNF therapy in CD.
This has been the first study to investigate the utility of deep neural networks in predicting clinical outcomes using repeated clinical data in inflammatory bowel disease. Future studies should incorporate additional data types such as genetic, imaging and endoscopic factors.
Maaser C, Sturm A, Vavricka SR, Kucharzik T, Fiorino G, Annese V, Calabrese E, Baumgart DC, Bettenworth D, Borralho Nunes P, Burisch J, Castiglione F, Eliakim R, Ellul P, González-Lama Y, Gordon H, Halligan S, Katsanos K, Kopylov U, Kotze PG, Krustinš E, Laghi A, Limdi JK, Rieder F, Rimola J, Taylor SA, Tolan D, van Rheenen P, Verstockt B, Stoker J; European Crohn’s and Colitis Organisation [ECCO] and the European Society of Gastrointestinal and Abdominal Radiology [ESGAR]. ECCO-ESGAR Guideline for Diagnostic Assessment in IBD Part 1: Initial diagnosis, monitoring of known IBD, detection of complications.J Crohns Colitis. 2019;13:144-164.
[PubMed] [DOI][Cited in This Article: ][Cited by in Crossref: 313][Cited by in F6Publishing: 234][Article Influence: 156.5][Reference Citation Analysis (0)]
Bouckaert RR, Frank E.
Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms, in Advances in Knowledge Discovery and Data Mining. In: Dai H, Srikant R, Zhang C. Lecture Notes in Computer Science. Springer: Berlin, Heidelberg, 2004.
[PubMed] [DOI][Cited in This Article: ]
van der Valk ME, Mangen MJ, Severs M, van der Have M, Dijkstra G, van Bodegraven AA, Fidder HH, de Jong DJ, van der Woude CJ, Romberg-Camps MJ, Clemens CH, Jansen JM, van de Meeberg PC, Mahmmod N, van der Meulen-de Jong AE, Ponsioen CY, Bolwerk C, Vermeijden JR, Siersema PD, Leenders M, Oldenburg B; COIN study group and the Dutch Initiative on Crohn and Colitis. Evolution of Costs of Inflammatory Bowel Disease over Two Years of Follow-Up.PLoS One. 2016;11:e0142481.
[PubMed] [DOI][Cited in This Article: ][Cited by in Crossref: 66][Cited by in F6Publishing: 57][Article Influence: 13.2][Reference Citation Analysis (0)]
Best WR, Becktel JM, Singleton JW, Kern F Jr. Development of a Crohn's disease activity index. National Cooperative Crohn's Disease Study.Gastroenterology. 1976;70:439-444.
[PubMed] [DOI][Cited in This Article: ]