Wang HN, An JH, Wang FQ, Hu WQ, Zong L. Predicting gastric cancer survival using machine learning: A systematic review. World J Gastrointest Oncol 2025; 17(5): 103804 [DOI: 10.4251/wjgo.v17.i5.103804]
Corresponding Author of This Article
Liang Zong, MD, PhD, Department of Gastrointestinal Surgery, Changzhi People’s Hospital, The Affiliated Hospital of Changzhi Medical College, No. 502 Changxing Middle Road, Changzhi 046000, Shanxi Province, China. 250537471@qq.com
Research Domain of This Article
Gastroenterology & Hepatology
Article-Type of This Article
Systematic Reviews
Open-Access Policy of This Article
This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Hong-Niu Wang, Fu-Qiang Wang, Wen-Qing Hu, Liang Zong, Department of Gastrointestinal Surgery, Changzhi People’s Hospital, The Affiliated Hospital of Changzhi Medical College, Changzhi 046000, Shanxi Province, China
Hong-Niu Wang, Graduate School of Medicine, Changzhi Medical College, Changzhi 046000, Shanxi Province, China
Jia-Hao An, Department of Graduate School of Medicine, Changzhi Medical College, Changzhi 046000, Shanxi Province, China
Author contributions: Wang HN and An JH contributed equally to the preparation of the manuscript; Wang HN designed the review, collected and analyzed the data, and wrote the manuscript; An JH also designed the review, collected and analyzed the data, provided detailed explanations for the figures, and drafted the manuscript; Wang FQ, Hu WQ and Zong L reviewed and revised the manuscript. All authors have read and approved the final version of the manuscript.
Conflict-of-interest statement: The authors report no relevant conflicts of interest for this article.
PRISMA 2009 Checklist statement: All authors have read the PRISMA 2009 checklist, and the manuscript has been prepared and revised according to the PRISMA 2009 checklist.
Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Liang Zong, MD, PhD, Department of Gastrointestinal Surgery, Changzhi People’s Hospital, The Affiliated Hospital of Changzhi Medical College, No. 502 Changxing Middle Road, Changzhi 046000, Shanxi Province, China. 250537471@qq.com
Received: December 4, 2024 Revised: February 20, 2025 Accepted: February 26, 2025 Published online: May 15, 2025 Processing time: 162 Days and 15.1 Hours
Abstract
BACKGROUND
Gastric cancer (GC) has a poor prognosis, and the accurate prediction of patient survival remains a significant challenge in oncology. Machine learning (ML) has emerged as a promising tool for survival prediction, though concerns regarding model interpretability, reliance on retrospective data, and variability in performance persist.
AIM
To evaluate ML applications in predicting GC survival and to highlight key limitations in current methods.
METHODS
A comprehensive search of PubMed and Web of Science in November 2024 identified 16 relevant studies published after 2019. The most frequently used ML models were deep learning (37.5%), random forests (37.5%), support vector machines (31.25%), and ensemble methods (18.75%). The dataset sizes varied from 134 to 14177 patients, with nine studies incorporating external validation.
RESULTS
The reported area under the curve values were 0.669–0.980 for overall survival, 0.920–0.960 for cancer-specific survival, and 0.710–0.856 for disease-free survival. These results highlight the potential of ML-based models to improve clinical practice by enabling personalized treatment planning and risk stratification.
CONCLUSION
Despite challenges concerning retrospective studies and a lack of interpretability, ML models show promise; prospective trials and multidimensional data integration are recommended for improving their clinical applicability.
Core Tip: Machine learning offers significant promise for predicting gastric cancer patients' survival, but challenges such as data quality, model interpretability, and generalizability must be addressed. This review highlights the importance of integrating diverse data types, robust data preprocessing, and advanced feature-selection techniques to improve prediction accuracy. While open-access and private datasets each have their advantages, ensuring the timeliness and relevance of data is essential for the development of clinically applicable models.
Citation: Wang HN, An JH, Wang FQ, Hu WQ, Zong L. Predicting gastric cancer survival using machine learning: A systematic review. World J Gastrointest Oncol 2025; 17(5): 103804
Gastric cancer (GC) is the fifth most common cancer worldwide and the fifth leading cause of cancer-related deaths[1]. Due to its aggressive nature and significant heterogeneity, GC presents a major global health challenge[2]. Its high incidence and poor prognosis complicate treatment strategies[3,4]. Despite advances in surgical techniques and perioperative chemotherapy, the 5-year survival rate for patients with GC remains alarmingly low at 5%–20%[5]. Cancer survival is defined as the period from diagnosis to death due to the cancer[6]. Survival analyses are crucial for clinicians, researchers, patients, and policymakers, highlighting the urgent need for reliable predictive systems to assess prognoses and guide treatment decisions.
Artificial intelligence (AI) has made significant strides in the healthcare sector, and its roles are rapidly expanding. As AI technologies continue to evolve, the digital transformation of healthcare is becoming increasingly evident[7]. Machine learning (ML), a subset of AI, is particularly promising, with the potential to play a pivotal role in supporting both healthcare providers and patients[8].
Traditionally, survival prediction models have been based on Cox regression, which estimates the relative risk of an event occurring through a linear combination of covariates. However, the accuracy of these models depends on the careful selection of relevant variables by researchers. In parallel, ML techniques have shown significant potential in enhancing cancer prognosis prediction. ML includes various analytical methods such as random forests (RFs), ensemble methods, naïve Bayes classifiers, support vector machines (SVMs), neural networks (NNs), decision trees (DTs), and eXtreme Gradient Boosting (XGBoost), among others[9]. Several studies have demonstrated that ML approaches can outperform traditional methods in predictive accuracy[10].
As electronic medical record datasets continue to expand, their volume and complexity now exceed the capacity for human analysis. This has led to issues such as diagnostic errors, workflow inefficiencies, and inappropriate treatments in healthcare systems[11]. AI is increasingly being leveraged to address these challenges, given its ability to quickly process vast amounts of data and extract valuable insights. ML offers the advantage of reducing clinicians' workloads and minimizing human error. The high performance of ML makes it a promising tool for healthcare providers, as it will facilitate the development of predictive models to forecast cancer survival rates. The clinical prospects of ML in healthcare, particularly in cancer management, are vast and continue to evolve.
By integrating and analyzing multiple types of data (e.g., clinical, genomic, imaging, and histopathological information), ML models hold the potential to predict patient survival outcomes and assist in personalizing treatment strategies. By providing precise and timely predictions, ML models could help oncologists identify high-risk patients early, enabling the implementation of tailored interventions and targeted therapies. Moreover, the ability of ML models to continuously learn and adapt from new data ensures that these models can remain relevant and accurate, improving over time as more information becomes available.
The performance of ML algorithms is heavily dependent on the quality of the data used. It is known that poor data quality can lead to suboptimal model performance and inaccurate predictions[12]. However, there has been limited research exploring the impact of data quality and preprocessing steps on ML models. In addition, the data must be meticulously processed and made interpretable for machines. This requires a precise labeling of the data used to train the models, as any inaccuracies can significantly affect the model's prognostic predictions. Model interpretability remains a critical challenge, especially in deep learning models[13-15]. Solutions to these interpretability issues should be considered alongside the vast potential of AI in research.
Despite the growing potential of AI, the application of ML in GC remains underexplored. To promote the integration of ML into routine clinical practice, it is crucial to bridge this gap and gain a deeper understanding of ML's role and progress in GC research. This systematic review aims to provide a comprehensive overview of the application of ML in predicting survival outcomes in GC prognosis models.
MATERIALS AND METHODS
Search strategy
The search strategy applied in this review was designed around the key phrase "predicting GC survival using machine learning". Three thematic groups were initially established, comprising five key terms: "gastric cancer," "survival," and "machine learning/deep learning/artificial intelligence" along with their relevant synonyms. Two reviewers (Wang HN and An JH) independently conducted searches in two electronic databases—PubMed and Web of Science—covering studies published up until November 22, 2024. The search keywords were "gastric cancer," "gastric tumor," "survival," "machine learning," "deep learning," and "artificial intelligence". The full search strategy used in this systematic review is detailed in Supplementary Table 1. The study adhered to the PRISMA guidelines[16].
Inclusion and exclusion criteria
The following inclusion criteria were applied: (1) Peer-reviewed articles; (2) Studies that used ML algorithms to model GC survival; and (3) Published in English. The exclusion criteria were: (1) Studies that did not use ML models; (2) Articles not written in English; and (3) Reviews, case reports, conference abstracts, editorials, letters, or articles with abstract-only content or no full text. Two independent reviewers (Wang HN and An JH) screened all titles and abstracts. Any discrepancies were resolved through discussion with a third experienced reviewer.
Data extraction
The data extraction was performed independently by two reviewers (Wang HN and An JH), with discrepancies resolved by consultation with a third reviewer (Wang FQ). The literature screening was conducted using MS Office Excel 2021. Initially, titles and abstracts were reviewed to exclude irrelevant studies, followed by a full-text assessment to determine eligibility. The CHARMS (Checklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies) checklist was also used[17]. The data extracted included the following categories (Supplementary Table 2).
Bibliographic information: First author, year of publication, country of study, type of prediction, and prediction outcomes.
Data characteristics: Data source, data type, sample size.
Data preparation and modeling process: Details on missing data handling, preprocessing algorithms and techniques, feature selection methods, types of predictive variables, and ML algorithms used.
Model construction and performance evaluation: Internal and external validation methods, evaluation metrics, calibration techniques, and hyperparameter tuning.
Predictive variables: Number and ranking of predictive variables, and their proportions.
Assessment of the risk of bias
The risk of bias in the included prediction models was assessed by two reviewers using the Prediction Model Risk of Bias Assessment Tool (PROBAST)[18]. The PROBAST is designed to evaluate both the risk of bias and applicability in prognostic and diagnostic models. It covers four domains: Participants, predictors, outcomes, and analysis. Each domain includes specific criteria for assessing low, high, and unclear risks of bias.
RESULTS
The comprehensive search of the PubMed and Web of Science databases identified 832 articles. After the exclusion of 269 duplicates, 563 unique records remained. The screening of the articles' titles and abstracts resulted in the exclusion of 521 studies that did not meet the inclusion criteria. Following a full-text assessment, 42 studies were considered for eligibility. Ultimately, 16 studies were included in the final analysis (Figure 1).
Figure 1 PRISMA flowchart of the study selection process.
Characteristics of the included studies
As summarized in Table 1, the included studies were published between 2019 and 2024. A detailed overview of these studies is provided in Table 2. Of the 16 studies, 14 focused on the overall survival (OS)[19-32], one addressed cancer-specific survival (CSS)[33], and four examined the disease-free survival (DFS)[20,21,26,34] of patients with GC.
Table 1 Extracted characteristics of the included articles.
Table 2 Classification of the features of the included articles.
Characteristics
Categories
Number (n)
OS
CSS
DFS
Dataset sources
Hospitals
6
-
3
SEER
3
1
-
TCGA
4
-
1
NOGCA
1
-
-
TANRIC
1
-
1
Dataset privacy
Public
8
1
1
Private
6
-
3
Data source
Single
6
1
1
Multiple
8
-
3
Preprocessing
Yes
14
1
3
No
-
-
1
Feature selection
Yes
13
1
2
No
1
-
2
Models
One
5
-
4
Two or more
9
1
-
Models type
GB
1
-
-
HGB
1
-
-
KNN
1
-
-
LR
2
-
-
NB
1
-
-
RF
6
-
-
SVM
5
-
2
XGboost
2
-
-
DL
6
-
2
MultiDeepCox-SC
1
-
-
Ensemble learning
2
1
-
Validation
Internal
14
1
3
External
8
-
2
Evaluation
C-index
10
-
3
AUC
13
1
4
Calibration
6
1
1
Brier-score
4
-
1
Accuracy
3
-
-
Specificity
2
-
-
Sensitivity
2
-
-
F1-score
2
-
-
IBS
1
-
-
Data types
Clinical
7
1
1
Image
1
-
-
Clinical + Image
1
-
2
Clinical + Molecular
4
-
1
Clinical + Molecular + Image
1
-
-
Assessment of the risk of bias
Bias risk and applicability assessments were conducted for the 16 studies included in this review. As shown in Table 3, three of the studies were classified as having a high risk of bias[21,22,34], five studies had a moderate risk of bias[20,25,30,31,33], and the remaining eight studies were considered to have a low risk of bias[19,23,24,26-29,32].
Table 3 Risk of bias and applicability assessment of included articles based on the Prediction Model Risk of Bias Assessment Tool criteria.
Seven studies used datasets from hospitals and clinics[21,26,29-32,34], four studies accessed the Surveillance, Epidemiology, and End Results (SEER) database[24,25,27,33] and another four studies relied on The Cancer Genome Atlas (TCGA) database[20,22,23,28]. The National Oesophago-GC Audit database[19] and the TANRIC database (an open-access resource for interactive exploration of lncRNAs in cancer)[20] were used in one study each. Of the datasets, nine were publicly available[19,20,22-25,27,28,33], while seven were private[21,26,29-32,34]. The dataset sizes ranged from 134 to 14177 records, with seven studies reporting > 1000 records[19,21,24,25,27,30,32].
Data preprocessing
Data preprocessing techniques were applied in 15 of the studies, addressing missing data in 11 of them[19-21,23-27,31-33]. The strategies for handling missing values included listwise deletion, multiple imputation, and the K-nearest neighbor (KNN) algorithm. Feature selection was performed in all but one study[21,34], with 10 of the studies providing detailed descriptions of their methods[19,20,22-24,27,28,31,32]. Common feature selection algorithms included the Boruta method[19,31], relief forward selection[20,31], minimal-redundancy-maximal-relevance (mRMR)[31], Cox regression[22,23,27,28], RFs[24,33], survival analysis[32], and LASSO regression[23,31]. One study also applied data normalization to ensure uniform scaling across features[30].
Data modeling
Five of the 16 studies used a single ML algorithm to construct their models[19-21,26,34]. The other 11 studies employed multiple algorithms, comparing their results to identify the best-performing model[22-25,27-33]. The most frequently used algorithms were RFs, SVMs, deep learning, and ensemble learning. Hyperparameter tuning was incorporated into model training in 13 studies[20-25,27-29,31-34].
Model validation
Seven studies applied internal validation only, and the other nine used both internal and external validation[4,21-26,28,32]. Cross-validation was the most common internal validation method. Performance evaluation metrics varied across the studies. The area under the curve (AUC) ranged from 0.669 to 0.980 across the 15 studies. Five studies reported C-indices between 0.63 and 0.84. Two studies provided metrics including accuracy (0.8910–0.9200), specificity (0.8715-0.9000), sensitivity (0.8942-0.9400), and the F1-score (0.9080-0.9200). Detailed information on these metrics is presented in Table 4.
Table 4 Classification of the used evaluation indicators into types of survival from the lowest to the highest.
Evaluation method
OS
CSS
DFS
Min (%)
Max (%)
Min (%)
Max (%)
Min (%)
Max (%)
AUC
66.90
98.00
92.00
96.00
71.00
85.60
C-index
63.00
0.84.00
-
-
65.40
71.00
Brier-score
13.70
0.25.00
-
-
-
-
Accuracy
89.10
0.92.00
-
-
-
-
Specificity
87.15
0.90.00
-
-
-
-
Sensitivity
89.42
0.94.00
-
-
-
-
F1-score
90.80
92.00
-
-
-
-
IBS
14.20
15.10
-
-
-
-
Among the studies comparing multiple models, ensemble and hybrid approaches demonstrated superior performance in three studies[22,28,33]. Deep learning models exhibited optimal results in three other studies[24,25,27]. RF algorithms were observed to be the most effective in two studies[29,32], while XGBoost[31] and gradient boosting (GB)[30] models were identified as the best-performing algorithms in one study each.
Key variables
Clinical tabular data were used as input in 10 studies[19,21,24-27,30,31,33,34], with eight studies relying solely on these data[19,24-27,30,31,33]. Two studies incorporated both clinical and image-based data[21,34], and one study exclusively used image data[29]. Six studies employed molecular data for survival prediction[20,22,23,28,31,32]. A ranking of frequently used predictors is presented in Table 5. Common predictors included age, sex, ethnicity, cancer type, stage, grade, lymph node count, metastasis, histopathological features, tumor size, primary tumor site, metastasis status, American Society of Anesthesiologists classification, treatment modality, history of other cancer(s), and marital status. Six studies evaluated the contribution of specific predictors to survival outcomes[19,24,27,30,32,33].
Table 5 Predictive variables for survival types extracted from the articles.
Selected features
Number (n)
Percentage (%)
Age
7
87.5
Stage
7
87.5
Grade
6
75.0
Treatment modality
6
75.0
Primary tumor site
5
62.5
Sex
4
50.0
Tumor size
4
50.0
Race
3
37.5
Histopathology type
3
37.5
Marital status
3
37.5
Positive lymph node numbers
2
25.0
Lymph node metastasis
2
25.0
Metastasis status
2
25.0
Regional nodes examined
1
12.5
Lymph node dissection
1
12.5
ASA grade
1
12.5
History of other cancers
1
12.5
Blood markers
1
12.5
Lauren type
1
12.5
Lymphovascular invasion
1
12.5
Months from diagnosis to treatment
1
12.5
Body weight
1
12.5
DISCUSSION
This review synthesizes the application of ML algorithms in predicting the survival of patients with GC, providing insights into the potential of ML for guiding clinical decision-making in this field. From an initial pool of 832 articles, 16 studies were included for analysis. These studies offer both qualitative and quantitative evidence concerning the use of ML models to predict the survival of GC patients. Although the number of ML-based survival prediction studies in GC is still limited, this review reveals that the available research encompasses multiple survival outcomes, including patient OS, CSS, and DFS. The scarcity of studies of each type of survival outcome required their inclusion in a unified analysis.
Our PROBAST assessment revealed that studies with a high or moderate risk of bias tend to overestimate the model performance due to methodological limitations, such as insufficient external validation or incomplete handling of missing data. Although these studies contributed to the variability in the observed AUC/C-index values, their findings should be interpreted with caution. To further enhance the clinical applicability of ML, future research must prioritize robust validation frameworks and transparent reporting, particularly when integrating novel data types.
Data collection and databases
Among the 16 studies, seven used open-access databases, offering transparency in data preprocessing and model construction. Open-access datasets allow for collaborative validation and the optimization of models across different research groups. However, the use of older public databases may present challenges, as clinical practices evolve over time. Models trained on outdated data may lose clinical relevance, posing a problem that has been previously highlighted[35]. While open-access databases are valuable for survival prediction studies, their potential limitations emphasize the need for real-time data management and model timeliness. In contrast, private datasets require ethics approval and informed consent, which restricts their broader validation and comparison, thus limiting their contribution to model refinement.
Despite the advantages of public datasets, the generalizability of models built on these datasets to specific clinical settings remains a concern. Many publicly available datasets were collected years ago and may not fully reflect current clinical practices, making it imperative to develop methods for ensuring the continued relevance of predictive models.
The studies reviewed herein used datasets of varying sizes, with the largest comprising 14177 clinical samples[27], and the smallest including 134 samples[20]. ML algorithms are well-suited for handling multidimensional data, with the assumption that larger sample sizes improve model accuracy[36,37]. Among the studies examined in this review, those using image datasets generally had fewer records than those based on tabular data. Training models with small datasets can lead to overfitting, reducing the findings' generalizability[38]. Nevertheless, image datasets often yield more accurate models than tabular data, due to the advanced capabilities of image-processing algorithms[39]. These models include feature extraction, selection, transfer learning, fine-tuning, augmentation, and object detection[39-41]. In addition, convolutional NN (CNNs) have shown promising results for 3D image analysis [42]. The increasing use of medical image datasets for survival prediction suggests that combining larger image datasets with more advanced CNN architectures will produce more robust models.
In addition to image data, molecular markers such as gene expression and mutations have emerged as important factors in predicting the survival of patients with GC. Recent studies have integrated multiple data types to enhance prediction accuracy. As GC research advances, new variables related to survival have been identified[43,44]. highlighting the disease's complexity and the need to consider these factors in predictive models.
Data preprocessing
Data quality is crucial for the performance of ML algorithms in predictive modeling[45]. Medical datasets often contain noise, redundancy, outliers, missing data, and irrelevant variables, each of which can degrade a model's performance[46]. Proper data preprocessing—including reduction, cleaning, transformation, and integration—is essential to improve the accuracy of ML models. Missing data should be handled appropriately, with techniques such as single imputation, regression, and KNN-based imputation being commonly employed[47]. However, simple imputation methods may introduce bias, particularly with high-dimensional or large datasets[48]. More advanced approaches, such as multiple imputation, offer greater reliability by providing unbiased estimates of missing values and their standard errors[49]. Deleting missing data is another option, though doing so can distort data distributions and introduce bias[48]. Although deletion is straightforward, it should be used cautiously to avoid compromising the model accuracy[50]. It is crucial to avoid altering the data distribution prior to model training to maintain the reliability of the predictions.
Normalization and standardization are important preprocessing steps that reduce redundancy and enhance data consistency[51,52]. These techniques, when combined with outlier removal, improve model performance. Effective data wrangling is critical to ensuring that the model produces reliable outputs. However, many studies have failed to report the preprocessing steps taken, which can significantly affect their model's performance. To enhance the generalizability of ML models in clinical settings, future research should prioritize comprehensive data cleaning, the robust handling of missing values, and a clear reporting of preprocessing methods.
Feature selection
Feature selection is another crucial aspect of ML model development. Among the studies reviewed herein, two (12.5%) did not use feature selection techniques[21,34]. Feature selection improves predictive accuracy by identifying the most relevant features related to GC survival. Hyperparameter tuning and feature selection are commonly used to prevent overfitting and improve model precision. One study using various feature selection algorithms demonstrated that models trained on a full feature dataset performed poorly compared to those trained on selected features[30]. Incorporating effective feature selection and extraction methods is therefore essential for achieving optimal model performance. Future studies should explore and compare features that are significantly relevant to the prediction of survival among patients with GC. Identifying universally applicable features could enhance the accuracy of predictions across various types of GC.
ML algorithms
RF and deep learning are the most commonly used ML methods in the studies reviewed herein, at 37.5% and 37.5% respectively. The RF technique is widely recognized for its ability to address overfitting and improve model generalization compared to other tree-based methods such as DT, gradient boosting machines (GBMs), and XGBoost. Deep learning algorithms have shown significant promise, particularly in processing large, diverse datasets and detecting patterns in images, videos, text, and other forms of data[53]. In six of the 16 studies analyzed in the present review, deep learning was identified as the best-performing algorithm[21,23-25,27,34], and RF was recognized as the top algorithm in four studies[19,22,29,33]. RF has demonstrated excellent performance on small datasets and is capable of handling data with complex and large feature spaces[54]. DT, RF, and XGBoost are tree-based models. However, tree-based models are considered effective for detecting non-monotonic or nonlinear relationships between dependent and independent variables. While tree-based models offer numerous advantages, they also have some limitations. In particular, when training tree-based models on small datasets with highly correlated predictor variables, the detection of interactions between predictors may be hindered, potentially leading to overfitting. One of the primary advantages of RF over individual DT algorithms is its ability to reduce overfitting. The RF algorithm has gained widespread recognition and popularity in the ML field due to its superior performance compared to other tree-based methods, such as DT, GBM, and XGBoost. In a model comprising 16 lncRNA features, SVM performed the best[27]. Although SVM performs well on balanced datasets, its performance tends to decline on imbalanced datasets. Studies in which SVM underperformed did not address the concept of a dataset imbalance during the algorithm's application. Hybrid and ensemble models that combine the strengths of multiple learning algorithms have gained traction in the medical field[55]. These models offer feature selection capabilities and can improve computational efficiency, performance, and generalization. One research group reported that their model performed well in predicting longer survival times (10 years)[20]. However, no study has included ML models capable of predicting short-term, mid-term, and long-term survival in GC. Future research should focus on developing models that can accurately predict both short-term and long-term survival outcomes.
Model interpretation and validation
Interpretability is a critical issue for ML models[13]. Unlike traditional statistical models, ML algorithms are often considered "black boxes," which makes it difficult for researchers to understand the underlying prediction process and identify key variables affecting outcomes. Model performance is influenced not only by data quality, preprocessing, feature selection, and the suitability of the algorithm, but also by the validation strategy employed. Model validation is essential for assessing the performance of ML algorithms. External validation is considered the gold standard for evaluating the generalizability of models[56,57]. Of the studies reviewed herein, nine had external validation, which enhances the reliability of models by testing them on independent datasets. Many studies have used internal validation methods such as cross-validation or random splitting. Although cross-validation is useful for limited datasets[58], external validation ensures that the models used are applicable to diverse populations, each with unique characteristics that may influence survival outcomes. With the increasing availability of open-access datasets, external validation methods have become more feasible, enhancing the robustness of model evaluation.
This systematic review has several limitations. The variability in the studies included limits direct comparisons between them. Moreover, many studies did not provide detailed performance metrics such as specificity, sensitivity, or predictive values. Future research should aim to include comprehensive reports of preprocessing, feature selection, hyperparameter tuning, and model validation procedures, along with performance metrics. This will improve the transparency and reproducibility of ML-based survival prediction models for GC.
Another limitation of this review is the reliance on only two databases, PubMed and Web of Science, for the literature search. While both are well-established and widely used for their robust coverage of peer-reviewed biomedical literature, we recognize that incorporating additional databases, such as Scopus and Embase, could further enrich the comprehensiveness of our review. Those databases may include relevant studies not captured by our search. Nevertheless, we believe that PubMed and Web of Science offer a solid foundation for this review, given their extensive inclusion of high-quality, peer-reviewed articles. Future reviews may benefit from expanding the database selection to enhance the breadth of included studies.
CONCLUSION
Predicting the survival rates for patients with GC is critical for guiding treatment decisions. ML models have shown promising potential for the prediction of survival outcomes and are becoming increasingly applied in this field. The findings of this review highlight the growing use of ML algorithms based on clinical data, particularly from the SEER and TCGA databases. However, challenges remain, particularly related to data preprocessing, feature selection, and model validation. Addressing these challenges will improve the reliability and applicability of ML models for predicting the survival of patients with GC. Future research should focus on refining feature selection, optimizing the model choice, and enhancing model validation to better predict patient survival outcomes.
Footnotes
Provenance and peer review: Invited article; Externally peer reviewed.
Peer-review model: Single blind
Specialty type: Gastroenterology and hepatology
Country of origin: China
Peer-review report’s classification
Scientific Quality: Grade B, Grade B, Grade C
Novelty: Grade B, Grade B, Grade B
Creativity or Innovation: Grade B, Grade B, Grade C
Scientific Significance: Grade B, Grade B, Grade C
P-Reviewer: Qi XS; Yang MQ; Yang S S-Editor: Liu H L-Editor: Filipodia P-Editor: Zhang XD
Guidotti R, Monreale A, Ruggieri S, Turini F, Giannotti F, Pedreschi D. A Survey of Methods for Explaining Black Box Models.ACM Comput Surv. 2019;51:1-42.
[PubMed] [DOI] [Full Text]
Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JP, Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration.BMJ. 2009;339:b2700.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 12640][Cited by in RCA: 13169][Article Influence: 823.1][Reference Citation Analysis (0)]
Tian M, Yao Z, Zhou Y, Gan Q, Wang L, Lu H, Wang S, Zhou P, Dai Z, Zhang S, Sun Y, Tang Z, Yu J, Wang X. DeepRisk network: an AI-based tool for digital pathology signature and treatment responsiveness of gastric cancer using whole-slide images.J Transl Med. 2024;22:182.
[RCA] [PubMed] [DOI] [Full Text][Reference Citation Analysis (0)]
Chen H, Zheng Z, Yang C, Tan T, Jiang Y, Xue W. Machine learning based intratumor heterogeneity signature for predicting prognosis and immunotherapy benefit in stomach adenocarcinoma.Sci Rep. 2024;14:23328.
[RCA] [PubMed] [DOI] [Full Text][Reference Citation Analysis (0)]
Wu M, Yang X, Liu Y, Han F, Li X, Wang J, Guo D, Tang X, Lin L, Liu C. Development and validation of a deep learning model for predicting postoperative survival of patients with gastric cancer.BMC Public Health. 2024;24:723.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 2][Reference Citation Analysis (0)]
Li X, Zhai Z, Ding W, Chen L, Zhao Y, Xiong W, Zhang Y, Lin D, Chen Z, Wang W, Gao Y, Cai S, Yu J, Zhang X, Liu H, Li G, Chen T. An artificial intelligence model to predict survival and chemotherapy benefits for gastric cancer patients after gastrectomy development and validation in international multicenter cohorts.Int J Surg. 2022;105:106889.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 1][Cited by in RCA: 1][Article Influence: 0.3][Reference Citation Analysis (0)]
Aznar-Gimeno R, García-González MA, Muñoz-Sierra R, Carrera-Lasfuentes P, Rodrigálvarez-Chamarro MV, González-Muñoz C, Meléndez-Estrada E, Lanas Á, Del Hoyo-Alonso R. GastricAITool: A Clinical Decision Support Tool for the Diagnosis and Prognosis of Gastric Cancer.Biomedicines. 2024;12.
[RCA] [PubMed] [DOI] [Full Text][Reference Citation Analysis (0)]
Jiang Y, Zhang Z, Yuan Q, Wang W, Wang H, Li T, Huang W, Xie J, Chen C, Sun Z, Yu J, Xu Y, Poultsides GA, Xing L, Zhou Z, Li G, Li R. Predicting peritoneal recurrence and disease-free survival from CT images in gastric cancer with multitask deep learning: a retrospective study.Lancet Digit Health. 2022;4:e340-e350.
[RCA] [PubMed] [DOI] [Full Text][Cited by in Crossref: 3][Cited by in RCA: 4][Article Influence: 1.3][Reference Citation Analysis (0)]
Al-jarrah OY, Yoo PD, Muhaidat S, Karagiannidis GK, Taha K. Efficient Machine Learning for Big Data: A Review.Big Data Res. 2015;2:87-93.
[PubMed] [DOI] [Full Text]
Maharana K, Mondal S, Nemade B. A review: Data pre-processing and data augmentation techniques.Global Transit Proc. 2022;3:91-99.
[PubMed] [DOI] [Full Text]
Siraj MM, Rahmat NA, Din MM.
A Survey on Privacy Preserving Data Mining Approaches and Techniques. Proceedings of the 2019 8th International Conference on Software and Computer Applications 2019.
[PubMed] [DOI] [Full Text]
Khadse V, Mahalle PN, Biraris SV.
An Empirical Comparison of Supervised Machine Learning Algorithms for Internet of Things Data. 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA) 2018.
[PubMed] [DOI] [Full Text]
Laupacis A. Clinical Prediction Rules. A Review and Suggested Modifications of Methodological Standards.JAMA. 1997;277:488-494.
[PubMed] [DOI] [Full Text]