1
|
Pushpanathan K, Zou M, Srinivasan S, Wong WM, Mangunkusumo EA, Thomas GN, Lai Y, Sun CH, Lam JSH, Tan MCJ, Lin HAH, Ma W, Koh VTC, Chen DZ, Tham YC. Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries? OPHTHALMOLOGY SCIENCE 2025; 5:100745. [PMID: 40291392 PMCID: PMC12022690 DOI: 10.1016/j.xops.2025.100745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 02/01/2025] [Accepted: 02/14/2025] [Indexed: 04/30/2025]
Abstract
Objective The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability. Design Cross-sectional study. Subjects Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions). Methods For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric). Main Outcome Measures Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15). Results O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 (P = 0.010) and 12.4 (P < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15. Conclusions While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability. Financial Disclosures Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Collapse
Affiliation(s)
- Krithi Pushpanathan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Minjie Zou
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Sahana Srinivasan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Wendy Meihua Wong
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Erlangga Ariadarma Mangunkusumo
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - George Naveen Thomas
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yien Lai
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Chen-Hsin Sun
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Janice Sing Harn Lam
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Marcus Chun Jin Tan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Hazel Anne Hui'En Lin
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Weizhi Ma
- Institute for AI Industry Research, Tsinghua University, Beijing, China
| | - Victor Teck Chang Koh
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - David Ziyou Chen
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yih-Chung Tham
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
- Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, Singapore
| |
Collapse
|
2
|
Delsoz M, Hassan A, Nabavi A, Rahdar A, Fowler B, Kerr NC, Ditta LC, Hoehn ME, DeAngelis MM, Grzybowski A, Tham YC, Yousefi S. Large Language Models: Pioneering New Educational Frontiers in Childhood Myopia. Ophthalmol Ther 2025; 14:1281-1295. [PMID: 40257570 PMCID: PMC12069199 DOI: 10.1007/s40123-025-01142-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2025] [Accepted: 04/02/2025] [Indexed: 04/22/2025] Open
Abstract
INTRODUCTION This study aimed to evaluate the performance of three large language models (LLMs), namely ChatGPT-3.5, ChatGPT-4o (o1 Preview), and Google Gemini, in producing patient education materials (PEMs) and improving the readability of online PEMs on childhood myopia. METHODS LLM-generated responses were assessed using three prompts. Prompt A requested to "Write educational material on childhood myopia." Prompt B added a modifier specifying "a sixth-grade reading level using the FKGL (Flesch-Kincaid Grade Level) readability formula." Prompt C aimed to rewrite existing PEMs to a sixth-grade level using FKGL. Reponses were assessed for quality (DISCERN tool), readability (FKGL, SMOG (Simple Measure of Gobbledygook)), Patient Education Materials Assessment Tool (PEMAT, understandability/actionability), and accuracy. RESULTS ChatGPT-4o (01) and ChatGPT-3.5 generated good-quality PEMs (DISCERN 52.8 and 52.7, respectively); however, quality declined from prompt A to prompt B (p = 0.001 and p = 0.013). Google Gemini produced fair-quality (DISCERN 43) but improved with prompt B (p = 0.02). All PEMs exceeded the 70% PEMAT understandability threshold but failed the 70% actionability threshold (40%). No misinformation was identified. Readability improved with prompt B; ChatGPT-4o (01) and ChatGPT-3.5 achieved a sixth-grade level or below (FGKL 6 ± 0.6 and 6.2 ± 0.3), while Google Gemini did not (FGKL 7 ± 0.6). ChatGPT-4o (01) outperformed Google Gemini in readability (p < 0.001) but was comparable to ChatGPT-3.5 (p = 0.846). Prompt C improved readability across all LLMs, with ChatGPT-4o (o1 Preview) showing the most significant gains (FKGL 5.8 ± 1.5; p < 0.001). CONCLUSIONS ChatGPT-4o (o1 Preview) demonstrates potential in producing accurate, good-quality, understandable PEMs, and in improving online PEMs on childhood myopia.
Collapse
Affiliation(s)
- Mohammad Delsoz
- Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, 930 Madison Ave., Suite 471, Memphis, TN, 38163, USA
| | - Amr Hassan
- Department of Ophthalmology, Gavin Herbert Eye Institute, University of California, Irvine, CA, USA
| | - Amin Nabavi
- Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, 930 Madison Ave., Suite 471, Memphis, TN, 38163, USA
| | - Amir Rahdar
- Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, 930 Madison Ave., Suite 471, Memphis, TN, 38163, USA
| | - Brian Fowler
- Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, 930 Madison Ave., Suite 471, Memphis, TN, 38163, USA
| | - Natalie C Kerr
- Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, 930 Madison Ave., Suite 471, Memphis, TN, 38163, USA
| | - Lauren Claire Ditta
- Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, 930 Madison Ave., Suite 471, Memphis, TN, 38163, USA
| | - Mary E Hoehn
- Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, 930 Madison Ave., Suite 471, Memphis, TN, 38163, USA
| | - Margaret M DeAngelis
- Departement of Ophthalmology, University at Buffalo, Buffalo, NY, USA
- Research Service, VA Western New York Healthcare System, Buffalo, NY, USA
| | - Andrzej Grzybowski
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, Poznan, Poland
| | - Yih-Chung Tham
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Republic of Singapore
- Centre of Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore, Republic of Singapore
| | - Siamak Yousefi
- Hamilton Eye Institute, Department of Ophthalmology, University of Tennessee Health Science Center, 930 Madison Ave., Suite 471, Memphis, TN, 38163, USA.
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| |
Collapse
|
3
|
Huang Y, Shi R, Chen C, Zhou X, Zhou X, Hong J, Chen Z. Evaluation of large language models for providing educational information in orthokeratology care. Cont Lens Anterior Eye 2025; 48:102384. [PMID: 39939269 DOI: 10.1016/j.clae.2025.102384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2024] [Revised: 02/02/2025] [Accepted: 02/07/2025] [Indexed: 02/14/2025]
Abstract
BACKGROUND Large language models (LLMs) are gaining popularity in solving ophthalmic problems. However, their efficacy in patient education regarding orthokeratology, one of the main myopia control strategies, has yet to be determined. METHODS This cross-sectional study established a question bank consisting of 24 orthokeratology-related questions used as queries for GTP-4, Qwen-72B, and Yi-34B to prompt responses in Chinese. Objective evaluations were conducted using an online platform. Subjective evaluations including correctness, relevance, readability, applicability, safety, clarity, helpfulness, and satisfaction were performed by experienced ophthalmologists and parents of myopic children using a 5-point Likert scale. The overall standardized scores were also calculated. RESULTS The word count of the responses from Qwen-72B (199.42 ± 76.82) was the lowest (P < 0.001), with no significant differences in recommended age among the LLMs. GPT-4 (3.79 ± 1.03) scored lower in readability than Yi-34B (4.65 ± 0.51) and Qwen-72B (4.65 ± 0.61) (P < 0.001). No significant differences in safety, relevance, correctness, and applicability were observed across the three LLMs. Parental evaluations rated all LLMs an average score exceeding 4.7 points, with GPT-4 outperforming the others in helpfulness (P = 0.004) and satisfaction (P = 0.016). Qwen-72B's overall standardized scores surpassed those of the other two LLMs (P = 0.048). CONCLUSIONS GPT-4 and the Chinese LLM Qwen-72B produced accurate and beneficial responses to inquiries on orthokeratology. Further enhancement to bolster precision is essential, particularly within diverse linguistic contexts.
Collapse
Affiliation(s)
- Yangyi Huang
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai 200031, China; NHC Key Laboratory of Myopia (Fudan University), Key Laboratory of Myopia, Chinese Academy of Medical Sciences, Shanghai 200031, China; Shanghai Research Center of Ophthalmology and Optometry, China; Shanghai Engineering Research Center of Laser and Autostereoscopic 3D for Vision Care, China
| | - Runhan Shi
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai 200031, China
| | - Can Chen
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai 200031, China; NHC Key Laboratory of Myopia (Fudan University), Key Laboratory of Myopia, Chinese Academy of Medical Sciences, Shanghai 200031, China; Shanghai Research Center of Ophthalmology and Optometry, China; Shanghai Engineering Research Center of Laser and Autostereoscopic 3D for Vision Care, China
| | - Xueyi Zhou
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai 200031, China; NHC Key Laboratory of Myopia (Fudan University), Key Laboratory of Myopia, Chinese Academy of Medical Sciences, Shanghai 200031, China; Shanghai Research Center of Ophthalmology and Optometry, China; Shanghai Engineering Research Center of Laser and Autostereoscopic 3D for Vision Care, China
| | - Xingtao Zhou
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai 200031, China; NHC Key Laboratory of Myopia (Fudan University), Key Laboratory of Myopia, Chinese Academy of Medical Sciences, Shanghai 200031, China; Shanghai Research Center of Ophthalmology and Optometry, China; Shanghai Engineering Research Center of Laser and Autostereoscopic 3D for Vision Care, China
| | - Jiaxu Hong
- Department of Ophthalmology, Eye & ENT Hospital, State Key Laboratory of Medical Neurobiology and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, 200031, China; NHC Key laboratory of Myopia and Related Eye Diseases, Shanghai, 200031, China; Shanghai Key Laboratory of Rare Disease Gene Editing and Cell Therapy; Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, 200032, China; Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, 201102, China.
| | - Zhi Chen
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai 200031, China; NHC Key Laboratory of Myopia (Fudan University), Key Laboratory of Myopia, Chinese Academy of Medical Sciences, Shanghai 200031, China; Shanghai Research Center of Ophthalmology and Optometry, China; Shanghai Engineering Research Center of Laser and Autostereoscopic 3D for Vision Care, China.
| |
Collapse
|
4
|
Guven Y, Ozdemir OT, Kavan MY. Performance of Artificial Intelligence Chatbots in Responding to Patient Queries Related to Traumatic Dental Injuries: A Comparative Study. Dent Traumatol 2025; 41:338-347. [PMID: 39578674 DOI: 10.1111/edt.13020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2024] [Revised: 11/04/2024] [Accepted: 11/06/2024] [Indexed: 11/24/2024]
Abstract
BACKGROUND/AIM Artificial intelligence (AI) chatbots have become increasingly prevalent in recent years as potential sources of online healthcare information for patients when making medical/dental decisions. This study assessed the readability, quality, and accuracy of responses provided by three AI chatbots to questions related to traumatic dental injuries (TDIs), either retrieved from popular question-answer sites or manually created based on the hypothetical case scenarios. MATERIALS AND METHODS A total of 59 traumatic injury queries were directed at ChatGPT 3.5, ChatGPT 4.0, and Google Gemini. Readability was evaluated using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. To assess response quality and accuracy, the DISCERN tool, Global Quality Score (GQS), and misinformation scores were used. The understandability and actionability of the responses were analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Statistical analysis included Kruskal-Wallis with Dunn's post hoc test for non-normal variables, and one-way ANOVA with Tukey's post hoc test for normal variables (p < 0.05). RESULTS The mean FKGL and FRE scores for ChatGPT 3.5, ChatGPT 4.0, and Google Gemini were 11.2 and 49.25, 11.8 and 46.42, and 10.1 and 51.91, respectively, indicating that the responses were difficult to read and required a college-level reading ability. ChatGPT 3.5 had the lowest DISCERN and PEMAT-P understandability scores among the chatbots (p < 0.001). ChatGPT 4.0 and Google Gemini were rated higher for quality (GQS score of 5) compared to ChatGPT 3.5 (p < 0.001). CONCLUSIONS In this study, ChatGPT 3.5, although widely used, provided some misleading and inaccurate responses to questions about TDIs. In contrast, ChatGPT 4.0 and Google Gemini generated more accurate and comprehensive answers, making them more reliable as auxiliary information sources. However, for complex issues like TDIs, no chatbot can replace a dentist for diagnosis, treatment, and follow-up care.
Collapse
Affiliation(s)
- Yeliz Guven
- Istanbul University, Department of Pedodontics, Faculty of Dentistry, Istanbul, Turkey
| | - Omer Tarik Ozdemir
- Istanbul University, Department of Pedodontics, Faculty of Dentistry, Istanbul, Turkey
| | - Melis Yazir Kavan
- Istanbul University, Department of Pedodontics, Faculty of Dentistry, Istanbul, Turkey
| |
Collapse
|
5
|
Ma J, Yu J, Xie A, Huang T, Liu W, Ma M, Tao Y, Zang F, Zheng Q, Zhu W, Chen Y, Ning M, Zhu Y. Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro. Sci Rep 2025; 15:17635. [PMID: 40399509 DOI: 10.1038/s41598-025-02601-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Accepted: 05/14/2025] [Indexed: 05/23/2025] Open
Abstract
Large language models (LLMs) have established a presence in providing medical services to patients and supporting clinical practice for doctors. To explore the ability of LLMs in answering clinical questions related to autoimmune diseases, this study was designed with 65 questions related to autoimmune diseases, covering five domains: concepts, report interpretation, diagnosis, prevention and treatment, and prognosis. Types of diseases include Sjögren's syndrome, systemic lupus erythematosus, rheumatoid arthritis, systemic sclerosis, and others. These questions were answered by three LLMs: ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The responses were then evaluated by 8 clinicians based on criteria including relevance, completeness, accuracy, safety, readability, and simplicity. We analyzed the scores of the three LLMs across five domains and six dimensions and compared their accuracy in answering the report interpretation section with that of two senior doctors and two junior doctors. The results showed that the performance of the three LLMs in the evaluation of autoimmune diseases significantly surpassed that of both junior and senior doctors. Notably, Claude 3.5 Sonnet excelled in providing comprehensive and accurate responses to clinical questions on autoimmune diseases, demonstrating the great potential of LLMs in assisting doctors with the diagnosis, treatment, and management of autoimmune diseases.
Collapse
Affiliation(s)
- Juntao Ma
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Jie Yu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Anran Xie
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Taihong Huang
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Wenjing Liu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Mengyin Ma
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Yue Tao
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Fuyu Zang
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Qisi Zheng
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Wenbo Zhu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China
| | - Yuxin Chen
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
| | - Mingzhe Ning
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
- Yizheng Hospital of Nanjing Drum Tower Hospital Group, Yizheng 211900, Jiangsu, China, Yangzhou, China.
| | - Yijia Zhu
- Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, Jiangsu, China.
| |
Collapse
|
6
|
Tan D, Huang Y, Liu M, Li Z, Wu X, Huang C. Identification of Online Health Information Using Large Pretrained Language Models: Mixed Methods Study. J Med Internet Res 2025; 27:e70733. [PMID: 40367512 DOI: 10.2196/70733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2024] [Revised: 03/13/2025] [Accepted: 04/13/2025] [Indexed: 05/16/2025] Open
Abstract
BACKGROUND Online health information is widely available, but a substantial portion of it is inaccurate or misleading, including exaggerated, incomplete, or unverified claims. Such misinformation can significantly influence public health decisions and pose serious challenges to health care systems. With advances in artificial intelligence and natural language processing, pretrained large language models (LLMs) have shown promise in identifying and distinguishing misleading health information, although their effectiveness in this area remains underexplored. OBJECTIVE This study aimed to evaluate the performance of 4 mainstream LLMs (ChatGPT-3.5, ChatGPT-4, Ernie Bot, and iFLYTEK Spark) in the identification of online health information, providing empirical evidence for their practical application in this field. METHODS Web scraping was used to collect data from rumor-refuting websites, resulting in 2708 samples of online health information, including both true and false claims. The 4 LLMs' application programming interfaces were used for authenticity verification, with expert results as benchmarks. Model performance was evaluated using semantic similarity, accuracy, recall, F1-score, content analysis, and credibility. RESULTS This study found that the 4 models performed well in identifying online health information. Among them, ChatGPT-4 achieved the highest accuracy at 87.27%, followed by Ernie Bot at 87.25%, iFLYTEK Spark at 87%, and ChatGPT-3.5 at 81.82%. Furthermore, text length and semantic similarity analysis showed that Ernie Bot had the highest similarity to expert texts, whereas ChatGPT-4 showed good overall consistency in its explanations. In addition, the credibility assessment results indicated that ChatGPT-4 provided the most reliable evaluations. Further analysis suggested that the highest misjudgment probabilities with respect to the LLMs occurred within the topics of food and maternal-infant nutrition management and nutritional science and food controversies. Overall, the research suggests that LLMs have potential in online health information identification; however, their understanding of certain specialized health topics may require further improvement. CONCLUSIONS The results demonstrate that, while these models show potential in providing assistance, their performance varies significantly in terms of accuracy, semantic understanding, and cultural adaptability. The principal findings highlight the models' ability to generate accessible and context-aware explanations; however, they fall short in areas requiring specialized medical knowledge or updated data, particularly for emerging health issues and context-sensitive scenarios. Significant discrepancies were observed in the models' ability to distinguish scientifically verified knowledge from popular misconceptions and in their stability when processing complex linguistic and cultural contexts. These challenges reveal the importance of refining training methodologies to improve the models' reliability and adaptability. Future research should focus on enhancing the models' capability to manage nuanced health topics and diverse cultural and linguistic nuances, thereby facilitating their broader adoption as reliable tools for online health information identification.
Collapse
Affiliation(s)
- Dongmei Tan
- College of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Yi Huang
- College of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Ming Liu
- College of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Ziyu Li
- Human Resources Department, Army Medical Center, Army Medical University (The Third Military Medical University), Chongqing, China
| | - Xiaoqian Wu
- Department of Quality Management, Army Medical Center, Army Medical University (The Third Military Medical University), Chongqing, China
| | - Cheng Huang
- College of Medical Informatics, Chongqing Medical University, Chongqing, China
| |
Collapse
|
7
|
Luo X, Tham YC, Giuffrè M, Ranisch R, Daher M, Lam K, Eriksen AV, Hsu CW, Ozaki A, Moraes FYD, Khanna S, Su KP, Begagić E, Bian Z, Chen Y, Estill J. Reporting guideline for the use of Generative Artificial intelligence tools in MEdical Research: the GAMER Statement. BMJ Evid Based Med 2025:bmjebm-2025-113825. [PMID: 40360239 DOI: 10.1136/bmjebm-2025-113825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/09/2025] [Indexed: 05/15/2025]
Abstract
OBJECTIVES Generative artificial intelligence (GAI) tools can enhance the quality and efficiency of medical research, but their improper use may result in plagiarism, academic fraud and unreliable findings. Transparent reporting of GAI use is essential, yet existing guidelines from journals and institutions are inconsistent, with no standardised principles. DESIGN AND SETTING International online Delphi study. PARTICIPANTS International experts in medicine and artificial intelligence. MAIN OUTCOME MEASURES The primary outcome measure is the consensus level of the Delphi expert panel on the items of inclusion criteria for GAMER (Rreporting guideline for the use of Generative Artificial intelligence tools in MEdical Research). RESULTS The development process included a scoping review, two Delphi rounds and virtual meetings. 51 experts from 26 countries participated in the process (44 in the Delphi survey). The final checklist comprises nine reporting items: general declaration, GAI tool specifications, prompting techniques, tool's role in the study, declaration of new GAI model(s) developed, artificial intelligence-assisted sections in the manuscript, content verification, data privacy and impact on conclusions. CONCLUSION GAMER provides universal and standardised guideline for GAI use in medical research, ensuring transparency, integrity and quality.
Collapse
Affiliation(s)
- Xufei Luo
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu, China
| | - Yih Chung Tham
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
- Ophthalmology and Visual Science Academic Clinical Program, Duke-NUS Medical School, Singapore
| | - Mauro Giuffrè
- Department of Internal Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, Connecticut, USA
- Department of Medical, Surgical, and Health Sciences, University of Trieste, Trieste, Italy
| | - Robert Ranisch
- Faculty of Health Sciences Brandenburg, University of Potsdam, Potsdam, Brandenburg, Germany
| | - Mohammad Daher
- Orthopedic department, Hôtel Dieu de France, Beirut, Lebanon
| | - Kyle Lam
- Department of Surgery and Cancer, Imperial College London, London, UK
| | | | - Che-Wei Hsu
- Department of Psychological Medicine, Dunedin School of Medicine, University of Otago, Dunedin, New Zealand
- Bachelor of Social Services, College of Community Development and Personal Wellbeing, Otago Polytechnic, Dunedin, New Zealand
| | - Akihiko Ozaki
- Jyoban Hospital of Tokiwa Foundation, Iwaki, Fukushima, Japan
| | | | - Sahil Khanna
- Gastroenterology and Hepatology, Mayo Clinic, Rochester, Minnesota, USA
| | - Kuan-Pin Su
- Mind-Body Interface Research Center (MBI-Lab), China Medical University Hospital, Taichung, Taiwan
- An-Nan Hospital, China Medical University, Tainan, Taiwan
| | - Emir Begagić
- Department of Neurosurgery, Cantonal Hospital Zenica, Zenica, Bosnia and Herzegovina
| | - Zhaoxiang Bian
- Vincent V.C. Woo Chinese Medicine Clinical Research Institute, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
- Chinese EQUATOR Centre, Hong Kong, China
| | - Yaolong Chen
- Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Sciences (2021RU017), School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu, China
- Evidence-based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
- WHO Collaborating Centre for Guideline Implementation and Knowledge Translation, Lanzhou, China
| | - Janne Estill
- Evidence-based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
- Institute of Global Health, University of Geneva, Geneve, Switzerland
| |
Collapse
|
8
|
Li Y, Li Z, Li J, Liu L, Liu Y, Zhu B, Shi K, Lu Y, Li Y, Zeng X, Feng Y, Wang X. The actual performance of large language models in providing liver cirrhosis-related information: A comparative study. Int J Med Inform 2025; 201:105961. [PMID: 40334344 DOI: 10.1016/j.ijmedinf.2025.105961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 04/08/2025] [Accepted: 05/03/2025] [Indexed: 05/09/2025]
Abstract
OBJECTIVE With the increasing prevalence of large language models (LLMs) in the medical field, patients are increasingly turning to advanced online resources for information related to liver cirrhosis due to its long-term management processes. Therefore, a comprehensive evaluation of real-world performance of LLMs in these specialized medical areas is necessary. METHODS This study evaluates the performance of four mainstream LLMs (ChatGPT-4o, Claude-3.5 Sonnet, Gemini-1.5 Pro, and Llama-3.1) in answering 39 questions related to liver cirrhosis. The information quality, readability and accuracy were assessed using Ensuring Quality Information for Patients tool, Flesch-Kincaid metrics and consensus scoring. The simplification and their self-correction ability of LLMs were also assessed. RESULTS Significant performance differences were observed among the models. Gemini scored highest in providing high-quality information. While the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they exhibited strong capabilities in simplifying complex information. ChatGPT performed best in terms of accuracy, with a "Good" rating of 80%, higher than Claude (72%), Gemini (49%), and Llama (64%). All models received high scores for comprehensiveness. Each of the four LLMs demonstrated some degree of self-correction ability, improving the accuracy of initial answers with simple prompts. ChatGPT's and Llama's accuracy improved by 100%, Claude's by 50% and Gemini's by 67%. CONCLUSION LLMs demonstrate excellent performance in generating health information related to liver cirrhosis, yet they exhibit differences in answer quality, readability and accuracy. Future research should enhance their value in healthcare, ultimately achieving reliable, accessible and patient-centered medical information dissemination.
Collapse
Affiliation(s)
- Yanqiu Li
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Zhuojun Li
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
| | - Jinze Li
- Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, Beijing, China
| | - Long Liu
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Yao Liu
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Bingbing Zhu
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Ke Shi
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Yu Lu
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Yongqi Li
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Xuanwei Zeng
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Ying Feng
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China.
| | - Xianbo Wang
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China.
| |
Collapse
|
9
|
Gunes YC, Cesur T. The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study. J Thorac Imaging 2025; 40:e0805. [PMID: 39269227 DOI: 10.1097/rti.0000000000000805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2024]
Abstract
PURPOSE To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology. MATERIALS AND METHODS We collected publicly available 124 "Case of the Month" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ 2 , Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests. RESULTS Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs ( P <0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 ( P <0.05). All LLMs and radiologists showed greater accuracy in specific cases ( P <0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity ( P >0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups ( P >0.05), except for Meta Llama 3 70b in the vascular cases ( P =0.040). CONCLUSIONS Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.
Collapse
Affiliation(s)
- Yasin Celal Gunes
- Department of Radiology, Kirikkale Yuksek Ihtisas Hospital, Kirikkale
| | - Turay Cesur
- Department of Radiology, Mamak State Hospital, Ankara, Türkiye
| |
Collapse
|
10
|
Zhao FF, He HJ, Liang JJ, Cen LP. Reply to 'Comment on: Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3'. Eye (Lond) 2025; 39:1433. [PMID: 40044837 PMCID: PMC12043904 DOI: 10.1038/s41433-025-03737-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2025] [Revised: 02/05/2025] [Accepted: 02/19/2025] [Indexed: 05/02/2025] Open
Affiliation(s)
- Fang-Fang Zhao
- Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China
| | - Han-Jie He
- Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China
| | - Jia-Jian Liang
- Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China
| | - Ling-Ping Cen
- Guangdong Provincial Key Laboratory of Medical Immunology and Molecular Diagnostics, School of Medical Technology, Guangdong Medical University, Dongguan, China.
| |
Collapse
|
11
|
Weber MT, Noll R, Marchl A, Facchinello C, Grünewaldt A, Hügel C, Musleh K, Wagner TOF, Storf H, Schaaf J. MedBot vs RealDoc: efficacy of large language modeling in physician-patient communication for rare diseases. J Am Med Inform Assoc 2025; 32:775-783. [PMID: 39998911 PMCID: PMC12012358 DOI: 10.1093/jamia/ocaf034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 02/04/2025] [Accepted: 02/12/2025] [Indexed: 02/27/2025] Open
Abstract
OBJECTIVES This study assesses the abilities of 2 large language models (LLMs), GPT-4 and BioMistral 7B, in responding to patient queries, particularly concerning rare diseases, and compares their performance with that of physicians. MATERIALS AND METHODS A total of 103 patient queries and corresponding physician answers were extracted from EXABO, a question-answering forum dedicated to rare respiratory diseases. The responses provided by physicians and generated by LLMs were ranked on a Likert scale by a panel of 4 experts based on 4 key quality criteria for health communication: correctness, comprehensibility, relevance, and empathy. RESULTS The performance of generative pretrained transformer 4 (GPT-4) was significantly better than the performance of the physicians and BioMistral 7B. While the overall ranking considers GPT-4's responses to be mostly correct, comprehensive, relevant, and emphatic, the responses provided by BioMistral 7B were only partially correct and empathetic. The responses given by physicians rank in between. The experts concur that an LLM could lighten the load for physicians, rigorous validation is considered essential to guarantee dependability and efficacy. DISCUSSION Open-source models such as BioMistral 7B offer the advantage of privacy by running locally in health-care settings. GPT-4, on the other hand, demonstrates proficiency in communication and knowledge depth. However, challenges persist, including the management of response variability, the balancing of comprehensibility with medical accuracy, and the assurance of consistent performance across different languages. CONCLUSION The performance of GPT-4 underscores the potential of LLMs in facilitating physician-patient communication. However, it is imperative that these systems are handled with care, as erroneous responses have the potential to cause harm without the requisite validation procedures.
Collapse
Affiliation(s)
- Magdalena T Weber
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Richard Noll
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Alexandra Marchl
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | | | - Achim Grünewaldt
- Department of Respiratory Medicine and Allergology, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Christian Hügel
- HELIOS Dr Horst Schmidt Kliniken Wiesbaden, Klinik für Pneumologie, Wiesbaden 65199, Germany
| | - Khader Musleh
- Department of Respiratory Medicine and Allergology, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Thomas O F Wagner
- European Reference Network for Rare Respiratory Diseases (ERN-LUNG), University Medicine Frankfurt, Frankfurt 60590, Germany
| | - Holger Storf
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| | - Jannik Schaaf
- Institute of Medical Informatics, University Medicine Frankfurt, Goethe University Frankfurt, Frankfurt 60590, Germany
| |
Collapse
|
12
|
Cao Y, Lu W, Shi R, Liu F, Liu S, Xu X, Yang J, Rong G, Xin C, Zhou X, Sun X, Hong J. Performance of popular large language models in glaucoma patient education: A randomized controlled study. ADVANCES IN OPHTHALMOLOGY PRACTICE AND RESEARCH 2025; 5:88-94. [PMID: 40162329 PMCID: PMC11951182 DOI: 10.1016/j.aopr.2024.12.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Revised: 11/13/2024] [Accepted: 12/01/2024] [Indexed: 04/02/2025]
Abstract
Purpose The advent of chatbots based on large language models (LLMs), such as ChatGPT, has significantly transformed knowledge acquisition. However, the application of LLMs in glaucoma patient education remains elusive. In this study, we comprehensively compared the performance of four common LLMs - Qwen, Baichuan 2, ChatGPT-4.0, and PaLM 2 - in the context of glaucoma patient education. Methods Initially, senior ophthalmologists were asked with scoring responses generated by the LLMs, which were answers to the most frequent glaucoma-related questions posed by patients. The Chinese Readability Platform was employed to assess the recommended reading age and reading difficulty score of the four LLMs. Subsequently, optimized models were filtered, and 29 glaucoma patients participated in posing questions to the chatbots and scoring the answers within a real-world clinical setting. Attending ophthalmologists were also required to score the answers across five dimensions: correctness, completeness, readability, helpfulness, and safety. Patients, on the other hand, scored the answers based on three dimensions: satisfaction, readability, and helpfulness. Results In the first stage, Baichuan 2 and ChatGPT-4.0 outperformed the other two models, though ChatGPT-4.0 had higher recommended reading age and reading difficulty scores. In the second stage, both Baichuan 2 and ChatGPT-4.0 demonstrated exceptional performance among patients and ophthalmologists, with no statistically significant differences observed. Conclusions Our research identifies Baichuan 2 and ChatGPT-4.0 as prominent LLMs, offering viable options for glaucoma education.
Collapse
Affiliation(s)
- Yuyu Cao
- Department of Ophthalmology, Eye & ENT Hospital, State Key Laboratory of Medical Neurobiology, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases Shanghai, China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
| | - Wei Lu
- Department of Ophthalmology and Vision Science, Shanghai Eye Ear Nose and Throat Hospital, Fudan University, Shanghai, China
| | - Runhan Shi
- Department of Ophthalmology, Eye & ENT Hospital, State Key Laboratory of Medical Neurobiology, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases Shanghai, China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
| | - Fuying Liu
- Department of Ophthalmology and Vision Science, Shanghai Eye Ear Nose and Throat Hospital, Fudan University, Shanghai, China
- People‘s Hospital of Junan, Qingdao University, Shandong, China
| | - Steven Liu
- Department of Statistics, College of Liberal Arts & Sciences, University of Illinois Urbana-Champaign, Urbana, Champaign, USA
| | - Xinwei Xu
- Faculty of Business and Economics, Hong Kong University, Hong Kong, China
| | - Jin Yang
- Department of Ophthalmology and Vision Science, Shanghai Eye Ear Nose and Throat Hospital, Fudan University, Shanghai, China
| | - Guangyu Rong
- Department of Ophthalmology, Eye & ENT Hospital, State Key Laboratory of Medical Neurobiology, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases Shanghai, China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
| | - Changchang Xin
- Department of Ophthalmology, Eye & ENT Hospital, State Key Laboratory of Medical Neurobiology, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases Shanghai, China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
| | - Xujiao Zhou
- Department of Ophthalmology, Eye & ENT Hospital, State Key Laboratory of Medical Neurobiology, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases Shanghai, China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
| | - Xinghuai Sun
- Department of Ophthalmology and Vision Science, Shanghai Eye Ear Nose and Throat Hospital, Fudan University, Shanghai, China
| | - Jiaxu Hong
- Department of Ophthalmology, Eye & ENT Hospital, State Key Laboratory of Medical Neurobiology, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases Shanghai, China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
| |
Collapse
|
13
|
Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. J Med Internet Res 2025; 27:e64486. [PMID: 40305085 PMCID: PMC12079073 DOI: 10.2196/64486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 02/04/2025] [Accepted: 04/03/2025] [Indexed: 05/02/2025] Open
Abstract
BACKGROUND Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field. More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent. OBJECTIVE This study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field. METHODS In this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024. Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports. The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. The NMA was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs were performed using a grading scale. A larger surface under the cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy. RESULTS The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases. Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087). ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis. Gemini (SUCRA=0.9649) had the highest rated SUCRA value for accuracy in the area of triage and classification. CONCLUSIONS Our study indicates that ChatGPT-4o has an advantage when answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate at the top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios. TRIAL REGISTRATION PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245.
Collapse
Affiliation(s)
- Ling Wang
- Fuzhou University Affiliated Provincial Hospital, Shengli Clinical Medical College, Fujian Medical University, Fuzhou, China
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Jinglin Li
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Boyang Zhuang
- Fujian Center For Drug Evaluation and Monitoring, Fuzhou, China
| | - Shasha Huang
- School of Pharmacy, Fujian University of Traditional Chinese Medicine, Fuzhou, China
| | - Meilin Fang
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Cunze Wang
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Wen Li
- Fuzhou University Affiliated Provincial Hospital, Shengli Clinical Medical College, Fujian Medical University, Fuzhou, China
| | - Mohan Zhang
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Shurong Gong
- The Third Department of Critical Care Medicine, Fuzhou University Affiliated Provincial Hospital, Shengli Clinical Medical College, Fujian Medical University, Fuzhou, Fujian, China
| |
Collapse
|
14
|
Jin K, Grzybowski A. Advancements in artificial intelligence for the diagnosis and management of anterior segment diseases. Curr Opin Ophthalmol 2025:00055735-990000000-00239. [PMID: 40279352 DOI: 10.1097/icu.0000000000001150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/27/2025]
Abstract
PURPOSE OF REVIEW The integration of artificial intelligence (AI) in the diagnosis and management of anterior segment diseases has rapidly expanded, demonstrating significant potential to revolutionize clinical practice. RECENT FINDINGS AI technologies, including machine learning and deep learning models, are increasingly applied in the detection and management of a variety of conditions, such as corneal diseases, refractive surgery, cataract, conjunctival disorders (e.g., pterygium), trachoma, and dry eye disease. By analyzing large-scale imaging data and clinical information, AI enhances diagnostic accuracy, predicts treatment outcomes, and supports personalized patient care. SUMMARY As AI models continue to evolve, particularly with the use of large models and generative AI techniques, they will further refine diagnosis and treatment planning. While challenges remain, including issues related to data diversity and model interpretability, AI's integration into ophthalmology promises to improve healthcare outcomes, making it a cornerstone of data-driven medical practice. The continued development and application of AI will undoubtedly transform the future of anterior segment ophthalmology, leading to more efficient, accurate, and individualized care.
Collapse
Affiliation(s)
- Kai Jin
- Zhejiang University, Eye Center of Second Affiliated Hospital, School of Medicine
- Zhejiang Provincial Key Laboratory of Ophthalmology, Zhejiang Provincial Clinical Research Center for Eye Diseases, Zhejiang Provincial Engineering Institute on Eye Diseases, Hangzhou, Zhejiang, China
| | - Andrzej Grzybowski
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, Poznan
- Department of Ophthalmology, University of Warmia and Mazury, Olsztyn, Poland
| |
Collapse
|
15
|
Liu R, Liu J, Yang J, Sun Z, Yan H. Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis. BMC Musculoskelet Disord 2025; 26:369. [PMID: 40241048 PMCID: PMC12001388 DOI: 10.1186/s12891-025-08601-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Accepted: 03/31/2025] [Indexed: 04/18/2025] Open
Abstract
BACKGROUND Osteoporosis is a sex-specific disease. Postmenopausal osteoporosis (PMOP) has been the focus of public health research worldwide. The purpose of this study is to evaluate the quality and readability of artificial intelligence large-scale language models (AI-LLMs): ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced for responses generated in response to questions related to PMOP. METHODS We collected 48 PMOP frequently asked questions (FAQs) through offline counseling and online medical community forums. We also prepared 24 specific questions about PMOP based on the Management of Postmenopausal Osteoporosis: 2022 ACOG Clinical Practice Guideline No. 2 (2022 ACOG-PMOP Guideline). In this project, the FAQs were imported into the AI-LLMs (ChatGPT-4o mini, ChatGPT-4o, Gemini Advanced) and randomly assigned to four professional orthopedic surgeons, who independently rated the satisfaction of each response via a 5-point Likert scale. Furthermore, a Flesch Reading Ease (FRE) score was calculated for each of the LLMs' responses to assess the readability of the text generated by each LLM. RESULTS When it comes to addressing questions related to PMOP and the 2022 ACOG-PMOP guidelines, ChatGPT-4o and Gemini Advanced provide more concise answers than ChatGPT-4o mini. In terms of the overall FAQs of PMOP, ChatGPT-4o has a significantly higher accuracy rate than ChatGPT-4o mini and Gemini Advanced. When answering questions related to the 2022 ACOG-PMOP guidelines, ChatGPT-4o mini vs. ChatGPT-4o have significantly higher response accuracy than Gemini Advanced. ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced all have good levels of self-correction. CONCLUSIONS Our research shows that Gemini Advanced and ChatGPT-4o provide more concise and intuitive answers. ChatGPT-4o responds better in answering frequently asked questions related to PMOP. When answering questions related to the 2022 ACOG-PMOP guidelines, ChatGPT-4o mini and ChatGPT-4o responded significantly better than Gemini Advanced. ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced have demonstrated a strong ability to self-correct. CLINICAL TRIAL NUMBER Not applicable.
Collapse
Affiliation(s)
- Rui Liu
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Jian Liu
- College of Computer Science, Nankai University, Tianjin, 300350, China
| | - Jia Yang
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Zhiming Sun
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China.
| | - Hua Yan
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China.
| |
Collapse
|
16
|
Nguyen T, Ong J, Jonnakuti V, Masalkhi M, Waisberg E, Aman S, Zaman N, Sarker P, Teo ZL, Ting DSW, Ting DSJ, Tavakkoli A, Lee AG. Artificial intelligence in the diagnosis and management of refractive errors. Eur J Ophthalmol 2025:11206721251318384. [PMID: 40223314 DOI: 10.1177/11206721251318384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/15/2025]
Abstract
Refractive error is among the leading causes of visual impairment globally. The diagnosis and management of refractive error has traditionally relied on comprehensive eye examinations by eye care professionals, but access to these specialized services has remained limited in many areas of the world. Given this, artificial intelligence (AI) has shown immense potential in transforming the diagnosis and management of refractive error. We review AI applications across various aspects of refractive error care - from axial length prediction using fundus images to risk stratification for myopia progression. AI algorithms can be trained to analyze clinical data to detect refractive error as well as predict associated risks of myopia progression. For treatments such as implantable collamer and orthokeratology lenses, AI models facilitate vault size prediction and optimal lens fitting with high accuracy. Furthermore, AI has demonstrated promise in optimizing surgical planning and outcomes for refractive procedures. Emerging digital technologies such as telehealth, smartphone applications, and virtual reality integrated with AI present novel avenues for refractive error screening. We discuss key challenges, including limited validation datasets, lack of data standardization, image quality issues, population heterogeneity, practical deployment, and ethical considerations regarding patient privacy that need to be addressed before widespread clinical implementation.
Collapse
Affiliation(s)
- Tuan Nguyen
- Weill Cornell/Rockefeller/Sloan-Kettering Tri-Institutional MD-PhD Program, New York City, New York, USA
| | - Joshua Ong
- Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, Ann Arbor, Michigan, USA
| | - Venkata Jonnakuti
- Medical Scientist Training Program, Baylor College of Medicine, Houston, Texas, USA
| | | | | | - Sarah Aman
- Wilmer Eye Institute, Johns Hopkins Medicine, Baltimore, Maryland, USA
| | - Nasif Zaman
- Human-Machine Perception Laboratory, Department of Computer Science and Engineering, University of Nevada, Reno, Reno, Nevada, USA
| | - Prithul Sarker
- Human-Machine Perception Laboratory, Department of Computer Science and Engineering, University of Nevada, Reno, Reno, Nevada, USA
| | - Zhen Ling Teo
- Singapore National Eye Centre, Singapore Eye Research Institute, Singapore, Republic of Singapore
| | - Daniel S W Ting
- Singapore National Eye Centre, Singapore Eye Research Institute, Singapore, Republic of Singapore
| | - Darren S J Ting
- Academic Unit of Ophthalmology, Institute of Inflammation and Ageing, University of Birmingham, Nottingham, UK
- Academic Ophthalmology, School of Medicine, University of Nottingham, Nottingham, UK
| | - Alireza Tavakkoli
- Human-Machine Perception Laboratory, Department of Computer Science and Engineering, University of Nevada, Reno, Reno, Nevada, USA
| | - Andrew G Lee
- The Houston Methodist Research Institute, Houston Methodist Hospital, Houston, Texas, USA
- Department of Ophthalmology, The University of Iowa Hospitals and Clinics, Iowa City, Iowa, USA
| |
Collapse
|
17
|
Jiang Z, Xu Y, Lim ZW, Wang Z, Han Y, Yew SME, Pan Z, Wang Q, Wu G, Wong TY, Wang X, Wang Y, Tham YC. Comparative performance analysis of global and chinese-domain large language models for myopia. Eye (Lond) 2025:10.1038/s41433-025-03775-5. [PMID: 40223113 DOI: 10.1038/s41433-025-03775-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Revised: 03/12/2025] [Accepted: 03/20/2025] [Indexed: 04/15/2025] Open
Abstract
BACKGROUND The performance of global large language models (LLMs), trained largely on Western data, for disease in other settings and languages is unknown. Taking myopia as an illustration, we evaluated the global versus Chinese-domain LLMs in addressing Chinese-specific myopia-related questions. METHODS Global LLMs (ChatGPT-3.5, ChatGPT-4.0, Google Bard, Llama-2 7B Chat) and Chinese-domain LLMs (Huatuo-GPT, MedGPT, Ali Tongyi Qianwen, and Baidu ERNIE Bot, Baidu ERNIE 4.0) were included. All LLMs were prompted to address 39 Chinese-specific myopia queries across 10 domains. 3 myopia experts evaluated the accuracy of responses with a 3-point scale. "Good"-rating responses were further evaluated for comprehensiveness and empathy using a five-point scale. "Poor"-rating responses were further prompted for self-correction and re-analysis. RESULTS The top 3 LLMs in accuracy were ChatGPT-3.5 (8.72 ± 0.75), Baidu ERNIE 4.0 (8.62 ± 0.62), and ChatGPT-4.0 (8.59 ± 0.93), with highest proportions of 94.8% "Good" responses. Top five LLMs with comprehensiveness were ChatGPT-3.5 (4.58 ± 0.42), ChatGPT-4.0 (4.56 ± 0.50), Baidu ERNIE 4.0 (4.44 ± 0.49), MedGPT (4.34 ± 0.59), and Baidu ERNIE Bot (4.22 ± 0.74) (all p ≥ 0.059, versus ChatGPT-3.5). While for empathy were ChatGPT-3.5 (4.75 ± 0.25), ChatGPT-4.0 (4.68 ± 0.32), MedGPT (4.50 ± 0.47), Baidu ERNIE Bot (4.42 ± 0.46), and Baidu ERNIE 4.0 (4.34 ± 0.64) (all p ≥ 0.052, versus ChatGPT-3.5). Baidu ERNIE 4.0 did not receive a "Poor" rating, while others demonstrated self-correction capabilities, showing enhancements ranging from 50% to 100%. CONCLUSIONS Global and Chinese-domain LLMs demonstrate effective performance in addressing Chinese-specific myopia-related queries. Global LLMs revealed optimal performance in Chinese-language settings despite primarily training with non-Chinese data and in English.
Collapse
Affiliation(s)
- Zehua Jiang
- Beijing Visual Science and Translational Eye Research Institute (BERI), Beijing Tsinghua Changgung Hospital Eye Center, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University, Beijing, China
| | - Yueyuan Xu
- Beijing Visual Science and Translational Eye Research Institute (BERI), Beijing Tsinghua Changgung Hospital Eye Center, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University, Beijing, China
| | - Zhi Wei Lim
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Ziyao Wang
- Department of Ophthalmology, Peking University Third Hospital, Beijing, China
| | - Yingxiang Han
- Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Center for Biomedical Engineering, School of Biological Science and Medical Engineering, Beihang University, Beijing, China
| | - Samantha Min Er Yew
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Zhe Pan
- Beijing Visual Science and Translational Eye Research Institute (BERI), Beijing Tsinghua Changgung Hospital Eye Center, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University, Beijing, China
| | - Qian Wang
- Beijing Tongren Eye Center, Beijing Key Laboratory of Intraocular Tumor Diagnosis and Treatment, Beijing Ophthalmology & Visual Sciences Key Lab, Medical Artificial Intelligence Research and Verification Key Laboratory of the Ministry of Industry and Information Technology, Beijing Tongren Hospital, Beijing Tongren Eye Center, Capital Medical University, Beijing, China
| | - Gangyue Wu
- Jinhua Eye Hospital, Jinhua, Zhejiang, China
| | - Tien Yin Wong
- Beijing Visual Science and Translational Eye Research Institute (BERI), Beijing Tsinghua Changgung Hospital Eye Center, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University, Beijing, China
- Beijing Key Laboratory of Intelligent Diagnostic Technology and Devices for Major Blinding Eye Diseases, Tsinghua Medicine, Tsinghua University, Beijing, China
- Singapore Eye Research Institute, Singapore National Eye Center, Eye Research Institute, Singapore, Singapore
| | - Xiaofei Wang
- Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Center for Biomedical Engineering, School of Biological Science and Medical Engineering, Beihang University, Beijing, China
| | - Yaxing Wang
- Beijing Visual Science and Translational Eye Research Institute (BERI), Beijing Tsinghua Changgung Hospital Eye Center, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University, Beijing, China.
- Beijing Key Laboratory of Intelligent Diagnostic Technology and Devices for Major Blinding Eye Diseases, Tsinghua Medicine, Tsinghua University, Beijing, China.
| | - Yih Chung Tham
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
- Singapore Eye Research Institute, Singapore National Eye Center, Eye Research Institute, Singapore, Singapore.
- Ophthalmology and Visual Science Academic Clinical Program, Duke-NUS Medical School, Singapore, Singapore.
| |
Collapse
|
18
|
Tay JRH, Chow DY, Lim YRI, Ng E. Enhancing patient-centered information on implant dentistry through prompt engineering: a comparison of four large language models. FRONTIERS IN ORAL HEALTH 2025; 6:1566221. [PMID: 40260428 PMCID: PMC12009804 DOI: 10.3389/froh.2025.1566221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2025] [Accepted: 03/21/2025] [Indexed: 04/23/2025] Open
Abstract
Background Patients frequently seek dental information online, and generative pre-trained transformers (GPTs) may be a valuable resource. However, the quality of responses based on varying prompt designs has not been evaluated. As dental implant treatment is widely performed, this study aimed to investigate the influence of prompt design on GPT performance in answering commonly asked questions related to dental implants. Materials and methods Thirty commonly asked questions about implant dentistry - covering patient selection, associated risks, peri-implant disease symptoms, treatment for missing teeth, prevention, and prognosis - were posed to four different GPT models with different prompt designs. Responses were recorded and independently appraised by two periodontists across six quality domains. Results All models performed well, with responses classified as good quality. The contextualized model performed worse on treatment-related questions (21.5 ± 3.4, p < 0.05), but outperformed the input-output, zero-shot chain of thought, and instruction-tuned models in citing appropriate sources in its responses (4.1 ± 1.0, p < 0.001). However, responses had less clarity and relevance compared to the other models. Conclusion GPTs can provide accurate, complete, and useful information for questions related to dental implants. While prompt designs can enhance response quality, further refinement is necessary to optimize its performance.
Collapse
Affiliation(s)
- John Rong Hao Tay
- Department of Restorative Dentistry, National Dental Centre Singapore, Singapore, Singapore
- Health Services and Systems Research Programme, Duke-NUS Medical School, Singapore, Singapore
| | - Dian Yi Chow
- Department of Restorative Dentistry, National Dental Centre Singapore, Singapore, Singapore
| | | | - Ethan Ng
- Department of Restorative Dentistry, National Dental Centre Singapore, Singapore, Singapore
- Centre for Oral Clinical Research, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| |
Collapse
|
19
|
Gnatzy R, Lacher M, Berger M, Boettcher M, Deffaa OJ, Kübler J, Madadi-Sanjani O, Martynov I, Mayer S, Pakarinen MP, Wagner R, Wester T, Zani A, Aubert O. Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance. Eur J Pediatr Surg 2025. [PMID: 40043742 DOI: 10.1055/a-2551-2131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/04/2025]
Abstract
The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.
Collapse
Affiliation(s)
- Richard Gnatzy
- Department of Pediatric Surgery, Leipzig University, Leipzig, Germany
| | - Martin Lacher
- Department of Pediatric Surgery, Leipzig University, Leipzig, Germany
| | - Michael Berger
- Department of Pediatric Surgery, University Hospital Essen, Essen, Germany
| | - Michael Boettcher
- Department of Pediatric Surgery, University Medical Centre Mannheim, Mannheim, Germany
| | - Oliver J Deffaa
- Department of Pediatric Surgery, Leipzig University, Leipzig, Germany
| | - Joachim Kübler
- Department of Pediatric Surgery, Hospital Bremen-Mitte, Bremen, Germany
| | - Omid Madadi-Sanjani
- Department of Pediatric Surgery, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Illya Martynov
- Centre for Pediatric Surgery, Department of Pediatric Surgery and Urology, University Hospital Giessen-Marburg, Baldingerstraße, Marburg, Germany
- Centre for Pediatric Surgery, Department of Pediatric Surgery, University Hospital Giessen-Marburg, Giessen, Germany
| | - Steffi Mayer
- Department of Pediatric Surgery, Leipzig University, Leipzig, Germany
| | - Mikko P Pakarinen
- Department of Pediatric Surgery, University of Helsinki Children's Hospital Unit of Pediatric Surgery, Helsinki, Finland
| | - Richard Wagner
- Department of Pediatric Surgery, Leipzig University, Leipzig, Germany
| | - Tomas Wester
- Department of Pediatric Surgery, Karolinska University Hospital, Stockholm, Sweden
- Department of Women's and Children's Health, Karolinska Institutet, Stockholm, Sweden
| | - Augusto Zani
- Department of Surgery, Division of Pediatric Surgery, Washington University School of Medicine, St. Louis, Missouri, United States
| | - Ophelia Aubert
- Department of Pediatric Surgery, Leipzig University, Leipzig, Germany
- Department of Pediatric Surgery, University Medical Centre Mannheim, Mannheim, Germany
| |
Collapse
|
20
|
Liang Z, Wang M, Abdelatif NMN, Arunakul M, Borbon CAV, Chong KW, Chow MW, Hua Y, Oji D, Ahumada X, Siu KM, Tan KJ, Tanaka Y, Taniguchi A, Yung PSH, Ling SKK. Are Large Language Model-Based Chatbots Effective in Providing Reliable Medical Advice for Achilles Tendinopathy? An International Multispecialist Evaluation. Orthop J Sports Med 2025; 13:23259671251332596. [PMID: 40322749 PMCID: PMC12046157 DOI: 10.1177/23259671251332596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/27/2024] [Accepted: 12/02/2024] [Indexed: 05/08/2025] Open
Abstract
Background Large language model (LLM)-based chatbots have shown potential in providing health information and patient education. However, the reliability of these chatbots in offering medical advice for specific conditions like Achilles tendinopathy remains uncertain. Mixed outcomes in the field of orthopaedics highlight the need for further examination of these chatbots' reliability. Hypothesis Three leading LLM-based chatbots can provide accurate and complete responses to inquiries related to Achilles tendinopathy. Study Design Cross-sectional study. Methods Eighteen questions derived from the Dutch clinical guideline on Achilles tendinopathy were posed to 3 leading LLM-based chatbots: ChatGPT 4.0, Claude 2, and Gemini. The responses were incorporated into an online survey assessed by orthopaedic surgeons specializing in Achilles tendinopathy. Responses were evaluated using a 4-point scoring system, where 1 indicates unsatisfactory and 4 indicates excellent. The total scores for the 18 responses were aggregated for each rater and compared across the chatbots. The intraclass correlation coefficient was calculated to assess consistency among the raters' evaluations. Results Thirteen specialists from 9 diverse countries and regions participated. Analysis showed no significant difference in the mean total scores among the chatbots: ChatGPT (59.7 ± 5.5), Claude 2 (53.4 ± 9.7), and Gemini (53.6 ± 8.4). The proportions of unsatisfactory responses (score 1) were low and comparable across chatbots: 0.9% for ChatGPT 4.0, 3.4% for Claude 2, and 3.4% for Gemini. In terms of excellent responses (score 4), ChatGPT 4.0 outperformed the others, with 43.6% of the responses rated as excellent, significantly higher than Claude 2 at 27.4% and Gemini at 25.2% (P < .001 for both comparisons). Intraclass correlation coefficients indicated poor reliability for ChatGPT 4.0 (0.420) and moderate reliability for Claude 2 (0.522) and Gemini (0.575). Conclusion While LLM-based chatbots such as ChatGPT 4.0 can deliver high-quality responses to queries regarding Achilles tendinopathy, the inconsistency among specialist evaluations and the absence of standardized assessment criteria significantly challenge our ability to draw definitive conclusions. These issues underscore the need for a cautious and standardized approach when considering the integration of LLM-based chatbots into clinical settings.
Collapse
Affiliation(s)
- Zuru Liang
- Department of Orthopaedics and Traumatology, The Chinese University of Hong Kong, Hong Kong, SAR, China
| | - Ming Wang
- Department of Orthopaedics and Traumatology, The Chinese University of Hong Kong, Hong Kong, SAR, China
| | | | - Marut Arunakul
- Department of Orthopedic Surgery, Faculty of Medicine, Thammasat University, Pathumthani, Thailand
| | | | | | - Man Wai Chow
- Department of Orthopaedics and Traumatology, The Chinese University of Hong Kong, Hong Kong, SAR, China
| | - Yinghui Hua
- Department of Sports Medicine, Huashan Hospital, Fudan University, Shanghai, China
| | - David Oji
- Foot and Ankle Surgery, Department of Orthopaedic Surgery, Stanford University School of Medicine, Redwood City, California, USA
| | | | - Kwai Ming Siu
- Department of Orthopaedics and Traumatology, Princess Margaret Hospital, Hong Kong, China
| | - Ken Jin Tan
- OrthoSports Clinic for Orthopedic Surgery and Sports Medicine, Mt. Elizabeth Novena Specialist Centre, Singapore
| | - Yasuhito Tanaka
- Department of Orthopaedic Surgery, Nara Medical University, Kashihara, Nara, Japan
| | - Akira Taniguchi
- Department of Orthopaedic Surgery, Nara Medical University, Kashihara, Nara, Japan
| | - Patrick Shu-Hang Yung
- Department of Orthopaedics and Traumatology, The Chinese University of Hong Kong, Hong Kong, SAR, China
| | - Samuel Ka-Kin Ling
- Department of Orthopaedics and Traumatology, The Chinese University of Hong Kong, Hong Kong, SAR, China
- Investigation performed at The Chinese University of Hong Kong, Hong Kong
| |
Collapse
|
21
|
Portilla ND, Garcia-Font M, Nagendrababu V, Abbott PV, Sanchez JAG, Abella F. Accuracy and Consistency of Gemini Responses Regarding the Management of Traumatized Permanent Teeth. Dent Traumatol 2025; 41:171-177. [PMID: 39460511 DOI: 10.1111/edt.13004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Revised: 09/27/2024] [Accepted: 09/29/2024] [Indexed: 10/28/2024]
Abstract
BACKGROUND The aim of this cross-sectional observational analytical study was to assess the accuracy and consistency of responses provided by Google Gemini (GG), a free-access high-performance multimodal large language model, to questions related to the European Society of Endodontology position statement on the management of traumatized permanent teeth (MTPT). MATERIALS AND METHODS Three academic endodontists developed a set of 99 yes/no questions covering all areas of the MTPT. Nine general dentists and 22 endodontic specialists evaluated these questions for clarity and comprehension through an iterative process. Two academic dental trauma experts categorized the knowledge required to answer each question into three levels. The three academic endodontists submitted the 99 questions to the GG, resulting in 297 responses, which were then assessed for accuracy and consistency. Accuracy was evaluated using the Wald binomial method, while the consistency of GG responses was assessed using the kappa-Fleiss coefficient with a confidence interval of 95%. A 5% significance level chi-squared test was used to evaluate the influence of question level of knowledge on accuracy and consistency. RESULTS The responses generated by Gemini showed an overall moderate accuracy of 80.81%, with no significant differences found between the responses of the academic endodontists. Overall, high consistency (95.96%) was demonstrated, with no significant differences between GG responses across the three accounts. The analysis also revealed no correlation between question level of knowledge and accuracy or consistency, with no significant differences. CONCLUSIONS The results of this study could significantly impact the potential use of Gemini as a free-access source of information for clinicians in the MTPT.
Collapse
Affiliation(s)
- Nicolas Dufey Portilla
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
- Department of Endodontics, School of Dentistry, Universidad Andres Bello, Viña del Mar, Chile
| | - Marc Garcia-Font
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
| | - Venkateshbabu Nagendrababu
- Department of Preventive and Restorative Dentistry, College of Dental Medicine, University of Sharjah, Sharjah, UAE
| | - Paul V Abbott
- UWA Dental School, The University of Western Australia, Perth, Western Australia, Australia
| | | | - Francesc Abella
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
| |
Collapse
|
22
|
Deng J, Li L, Oosterhof JJ, Malliaras P, Silbernagel KG, Breda SJ, Eygendaal D, Oei EH, de Vos RJ. ChatGPT is a comprehensive education tool for patients with patellar tendinopathy, but it currently lacks accuracy and readability. Musculoskelet Sci Pract 2025; 76:103275. [PMID: 39899928 DOI: 10.1016/j.msksp.2025.103275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 01/23/2025] [Accepted: 01/30/2025] [Indexed: 02/05/2025]
Abstract
BACKGROUND Generative artificial intelligence tools, such as ChatGPT, are becoming increasingly integrated into daily life, and patients might turn to this tool to seek medical information. OBJECTIVE To evaluate the performance of ChatGPT-4 in responding to patient-centered queries for patellar tendinopathy (PT). METHODS Forty-eight patient-centered queries were collected from online sources, PT patients, and experts and were then submitted to ChatGPT-4. Three board-certified experts independently assessed the accuracy and comprehensiveness of the responses. Readability was measured using the Flesch-Kincaid Grade Level (FKGL: higher scores indicate a higher grade reading level). The Patient Education Materials Assessment Tool (PEMAT) evaluated understandability, and actionability (0-100%, higher scores indicate information with clearer messages and more identifiable actions). Semantic Textual Similarity (STS score, 0-1; higher scores indicate higher similarity) assessed variation in the meaning of texts over two months (including ChatGPT-4o) and for different terminologies related to PT. RESULTS Sixteen (33%) of the 48 responses were rated accurate, while 36 (75%) were rated comprehensive. Only 17% of treatment-related questions received accurate responses. Most responses were written at a college reading level (median and interquartile range [IQR] of FKGL score: 15.4 [14.4-16.6]). The median of PEMAT for understandability was 83% (IQR: 70%-92%), and for actionability, it was 60% (IQR: 40%-60%). The medians of STS scores in the meaning of texts over two months and across terminologies were all ≥ 0.9. CONCLUSIONS ChatGPT-4 provided generally comprehensive information in response to patient-centered queries but lacked accuracy and was difficult to read for individuals below a college reading level.
Collapse
Affiliation(s)
- Jie Deng
- Department of Orthopedics and Sports Medicine, Erasmus MC University Medical Center, Rotterdam, the Netherlands; Department of Radiology and Nuclear Medicine, Erasmus MC University Medical Center, Rotterdam, the Netherlands.
| | - Lun Li
- Department of Artificial Intelligence, Bernoulli Institute, Faculty of Science and Engineering, University of Groningen, the Netherlands
| | - Jelle J Oosterhof
- Department of Sports Medicine, Haaglanden MC, Leidschendam, the Netherlands
| | - Peter Malliaras
- Department of Physiotherapy Monash University, Melbourne, Australia
| | | | - Stephan J Breda
- Department of Radiology and Nuclear Medicine, Erasmus MC University Medical Center, Rotterdam, the Netherlands
| | - Denise Eygendaal
- Department of Orthopedics and Sports Medicine, Erasmus MC University Medical Center, Rotterdam, the Netherlands
| | - Edwin Hg Oei
- Department of Radiology and Nuclear Medicine, Erasmus MC University Medical Center, Rotterdam, the Netherlands
| | - Robert-Jan de Vos
- Department of Orthopedics and Sports Medicine, Erasmus MC University Medical Center, Rotterdam, the Netherlands
| |
Collapse
|
23
|
Sallam M, Alasfoor IM, Khalid SW, Al-Mulla RI, Al-Farajat A, Mijwil MM, Zahrawi R, Sallam M, Egger J, Al-Adwan AS. Chinese generative AI models (DeepSeek and Qwen) rival ChatGPT-4 in ophthalmology queries with excellent performance in Arabic and English. NARRA J 2025; 5:e2371. [PMID: 40352182 PMCID: PMC12059827 DOI: 10.52225/narra.v5i1.2371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2025] [Accepted: 04/06/2025] [Indexed: 05/14/2025]
Abstract
The rapid evolution of generative artificial intelligence (genAI) has ushered in a new era of digital medical consultations, with patients turning to AI-driven tools for guidance. The emergence of Chinese-developed genAI models such as DeepSeek-R1 and Qwen-2.5 presented a challenge to the dominance of OpenAI's ChatGPT. The aim of this study was to benchmark the performance of Chinese genAI models against ChatGPT-40 and to assess disparities in performance across English and Arabic. Following the METRICS checklist for genAI evaluation, Qwen-2.5, DeepSeek-R1, and ChatGPT-40 were assessed for completeness, accuracy, and relevance using the CLEAR tool in common patient ophthalmology queries. In English, Qwen-2.5 demonstrated the highest overall performance (CLEAR score: 4.43 ± 0.28), outperforming both DeepSeek-R1 (4.3 ± 0.43) and ChatGPT-40 (4.14 ± 0.41), with p = 0.002. A similar hierarchy emerged in Arabic, with Qwen-2.5 again leading (4.40 ± 0.29), followed by DeepSeek-R1 (4.20 ± 0.49) and ChatGPT-40 (4.14 ± 0.41), with p = 0.007. Each tested genAI model exhibited near-identical performance across the two languages, with ChatGPT-40 demonstrating the most balanced linguistic capabilities (p = 0.957), while Qwen-2.5 and DeepSeek-R1 showed a marginal superiority for English. An in-depth examination of genAI performance across key CLEAR components revealed that Qwen-2.5 consistently excelled in content completeness, factual accuracy, and relevance in both English and Arabic, setting a new benchmark for genAI in medical inquiries. Despite minor linguistic disparities, all three models exhibited robust multilingual capabilities, challenging the long-held assumption that genAI is inherently biased toward English. These findings highlight the evolving nature of AI-driven medical assistance, with Chinese genAI models being able to rival or even surpass ChatGPT-40 in ophthalmology-related queries.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
| | - Israa M. Alasfoor
- Section of Ophthalmology, Department of Special Surgery, School of Medicine, The University of Jordan, Amman, Jordan
- Section of Ophthalmology, Department of Special Surgery, Jordan University Hospital, Amman, Jordan
| | - Shahad W. Khalid
- Section of Ophthalmology, Department of Special Surgery, School of Medicine, The University of Jordan, Amman, Jordan
- Section of Ophthalmology, Department of Special Surgery, Jordan University Hospital, Amman, Jordan
| | - Rand I. Al-Mulla
- Section of Ophthalmology, Department of Special Surgery, School of Medicine, The University of Jordan, Amman, Jordan
- Section of Ophthalmology, Department of Special Surgery, Jordan University Hospital, Amman, Jordan
| | - Amwaj Al-Farajat
- Section of Ophthalmology, Department of Special Surgery, School of Medicine, The University of Jordan, Amman, Jordan
- Section of Ophthalmology, Department of Special Surgery, Jordan University Hospital, Amman, Jordan
| | - Maad M. Mijwil
- College of Administration and Economics, Al-Iraqia University, Baghdad, Iraq
- Department of Computer Techniques Engineering, Baghdad College of Economic Sciences University, Baghdad, Iraq
| | - Reem Zahrawi
- Department of Ophthalmology, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
| | - Mohammed Sallam
- Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
- Department of Management, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
- Department of Management, School of Business, International American University, Los Angeles, United States
- College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences (MBRU), Dubai, United Arab Emirates
| | - Jan Egger
- Institute for Artificial Intelligence in Medicine (IKIM), Essen University Hospital (AoR), GirardetstraBe, Germany
- Center for Virtual and Extended Reality in Medicine (ZvRM), Essen University Hospital (AoR), HufelandstraBe, Germany
- Cancer Research Center Cologne Essen (CCCE), University Medicine Essen (AoR), HufelandstraBe, Germany
- University of Duisburg-Essen, Faculty of Computer Science, Schutzenbahn, Germany
| | - Ahmad S. Al-Adwan
- Department of Business Technology, Al-Ahliyya Amman University, Amman, Jordan
| |
Collapse
|
24
|
Niriella MA, Premaratna P, Senanayake M, Kodisinghe S, Dassanayake U, Dassanayake A, Ediriweera DS, de Silva HJ. The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study. Expert Rev Gastroenterol Hepatol 2025; 19:437-442. [PMID: 39985424 DOI: 10.1080/17474124.2025.2471874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 02/12/2025] [Accepted: 02/21/2025] [Indexed: 02/24/2025]
Abstract
BACKGROUND We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information. RESEARCH DESIGN AND METHODS We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response. RESULTS The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, p = 0.811], completeness [H(2) = 3.146, p = 0.207], or quality [H(2) = 3.350, p = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, p = 0.062], completeness [H(2) = 0.104, p = 0.949], or quality [H(2) = 0.420, p = 0.810] between the three raters (R1, R2, R3). CONCLUSION Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.
Collapse
|
25
|
Kang D, Wu H, Yuan L, Shen W, Feng J, Zhan J, Grzybowski A, Sun W, Jin K. Evaluating the Efficacy of Large Language Models in Guiding Treatment Decisions for Pediatric Refractive Error. Ophthalmol Ther 2025; 14:705-716. [PMID: 39985747 PMCID: PMC11920547 DOI: 10.1007/s40123-025-01105-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2024] [Accepted: 01/29/2025] [Indexed: 02/24/2025] Open
Abstract
INTRODUCTION Effective management of pediatric myopia, which includes treatments like corrective lenses and low-dose atropine, requires accurate clinical decisions. However, the complexity of pediatric refractive data, such as variations in visual acuity, axial length, and patient-specific factors, pose challenges to determining optimal treatment. This study aims to evaluate the performance of three large language models in analyzing these refractive data. METHODS A dataset of 100 pediatric refractive records, including parameters like visual acuity and axial length, was analyzed using ChatGPT-3.5, ChatGPT-4o, and Wenxin Yiyan, respectively. Each model was tasked with determining whether intervention was needed and subsequently recommending a treatment (eyeglasses, orthokeratology lens, or low-dose atropine). The recommendations were compared to professional optometrists' consensus, rated on a 1-5 Global Quality Score (GQS) scale, and evaluated for clinical safety utilizing a three-tier accuracy assessment. RESULTS ChatGPT-4o outperformed both ChatGPT-3.5 and Wenxin Yiyan in determining intervention needs, with an accuracy of 90%, significantly higher than Wenxin Yiyan (p < 0.05). It also achieved the highest GQS of 4.4 ± 0.55, surpassing the other models (p < 0.001), with 85% of responses rated as "good" ahead of ChatGPT-3.5 (82%) and Wenxin Yiyan (74%). ChatGPT-4o made only eight errors in recommending interventions, fewer than ChatGPT-3.5 (12) and Wenxin Yiyan (15). Additionally, it performed better with incomplete or abnormal data, maintaining higher quality scores. CONCLUSION ChatGPT-4o showed better accuracy and clinical safety, making it a promising tool for decision support in pediatric ophthalmology, although expert oversight is still necessary.
Collapse
Affiliation(s)
- Daohuan Kang
- Department of Ophthalmology, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China
| | - Hongkang Wu
- Zhejiang University, Eye Center of Second Affiliated Hospital, School of Medicine, Hangzhou, Zhejiang, China
- Zhejiang Provincial Key Laboratory of Ophthalmology, Zhejiang Provincial Clinical Research Center for Eye Diseases, Zhejiang Provincial Engineering Institute on Eye Diseases, Hangzhou, Zhejiang, China
| | - Lu Yuan
- Department of Ophthalmology, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China
| | - Wenyue Shen
- Zhejiang University, Eye Center of Second Affiliated Hospital, School of Medicine, Hangzhou, Zhejiang, China
- Zhejiang Provincial Key Laboratory of Ophthalmology, Zhejiang Provincial Clinical Research Center for Eye Diseases, Zhejiang Provincial Engineering Institute on Eye Diseases, Hangzhou, Zhejiang, China
| | - Jia Feng
- Department of Ophthalmology, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China
| | - Jiao Zhan
- Department of Ophthalmology, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China
| | - Andrzej Grzybowski
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, Poznan, Poland
| | - Wen Sun
- Department of Ophthalmology, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, China.
| | - Kai Jin
- Zhejiang University, Eye Center of Second Affiliated Hospital, School of Medicine, Hangzhou, Zhejiang, China.
- Zhejiang Provincial Key Laboratory of Ophthalmology, Zhejiang Provincial Clinical Research Center for Eye Diseases, Zhejiang Provincial Engineering Institute on Eye Diseases, Hangzhou, Zhejiang, China.
| |
Collapse
|
26
|
Li J, Chang C, Li Y, Cui S, Yuan F, Li Z, Wang X, Li K, Feng Y, Wang Z, Wei Z, Jian F. Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance. J Med Syst 2025; 49:39. [PMID: 40128385 DOI: 10.1007/s10916-025-02170-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2024] [Accepted: 03/16/2025] [Indexed: 03/26/2025]
Abstract
With the increasing application of large language models (LLMs) in the medical field, their potential in patient education and clinical decision support is becoming increasingly prominent. Given the complex pathogenesis, diverse treatment options, and lengthy rehabilitation periods of spinal cord injury (SCI), patients are increasingly turning to advanced online resources to obtain relevant medical information. This study analyzed responses from four LLMs-ChatGPT-4o, Claude-3.5 sonnet, Gemini-1.5 Pro, and Llama-3.1-to 37 SCI-related questions spanning pathogenesis, risk factors, clinical features, diagnostics, treatments, and prognosis. Quality and readability were assessed using the Ensuring Quality Information for Patients (EQIP) tool and Flesch-Kincaid metrics, respectively. Accuracy was independently scored by three senior spine surgeons using consensus scoring. Performance varied among the models. Gemini ranked highest in EQIP scores, suggesting superior information quality. Although the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they were all able to effectively simplify complex content. Notably, ChatGPT led in accuracy, achieving significantly higher "Good" ratings (83.8%) compared to Claude (78.4%), Gemini (54.1%), and Llama (62.2%). Comprehensiveness scores were high across all models. Furthermore, the LLMs exhibited strong self-correction abilities. After being prompted for revision, the accuracy of ChatGPT and Claude's responses improved by 100% and 50%, respectively; both Gemini and Llama improved by 67%. This study represents the first systematic comparison of leading LLMs in the context of SCI. While Gemini excelled in response quality, ChatGPT provided the most accurate and comprehensive responses.
Collapse
Affiliation(s)
- Jinze Li
- Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China
- Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China
| | - Chao Chang
- Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China
- Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China
| | - Yanqiu Li
- Center for Integrative Medicine, Beijing Ditan Hospital, Capital Medical University, Beijing, China
| | - Shengyu Cui
- Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China
- Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China
| | - Fan Yuan
- Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China
- Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China
| | - Zhuojun Li
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
| | - Xinyu Wang
- Baylor College of Medicine, Houston, TX, USA
| | - Kang Li
- Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China
- Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China
| | - Yuxin Feng
- Capital Medical University, Beijing, China
| | - Zuowei Wang
- Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China.
- Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China.
| | - Zhijian Wei
- Department of Orthopaedics, Qilu Hospital of Shandong University, Shandong University, No. 107 Wenhua West Road, Lixia District, 250012, Jinan, China.
| | - Fengzeng Jian
- Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China.
- Spine Center, China International Neuroscience Institute (CHINA-INI), Beijing, China.
| |
Collapse
|
27
|
Sridhar GR, Gumpeny L. Prospects and perils of ChatGPT in diabetes. World J Diabetes 2025; 16:98408. [PMID: 40093292 PMCID: PMC11885976 DOI: 10.4239/wjd.v16.i3.98408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 11/05/2024] [Accepted: 12/03/2024] [Indexed: 01/21/2025] Open
Abstract
ChatGPT, a popular large language model developed by OpenAI, has the potential to transform the management of diabetes mellitus. It is a conversational artificial intelligence model trained on extensive datasets, although not specifically health-related. The development and core components of ChatGPT include neural networks and machine learning. Since the current model is not yet developed on diabetes-related datasets, it has limitations such as the risk of inaccuracies and the need for human supervision. Nevertheless, it has the potential to aid in patient engagement, medical education, and clinical decision support. In diabetes management, it can contribute to patient education, personalized dietary guidelines, and providing emotional support. Specifically, it is being tested in clinical scenarios such as assessment of obesity, screening for diabetic retinopathy, and provision of guidelines for the management of diabetic ketoacidosis. Ethical and legal considerations are essential before ChatGPT can be integrated into healthcare. Potential concerns relate to data privacy, accuracy of responses, and maintenance of the patient-doctor relationship. Ultimately, while ChatGPT and large language models hold immense potential to revolutionize diabetes care, one needs to weigh their limitations, ethical implications, and the need for human supervision. The integration promises a future of proactive, personalized, and patient-centric care in diabetes management.
Collapse
Affiliation(s)
- Gumpeny R Sridhar
- Department of Endocrinology and Diabetes, Endocrine and Diabetes Centre, Visakhapatnam 530002, Andhra Pradesh, India
| | - Lakshmi Gumpeny
- Department of Internal Medicine, Gayatri Vidya Parishad Institute of Healthcare & Medical Technology, Visakhapatnam 530048, Andhra Pradesh, India
| |
Collapse
|
28
|
Ermis S, Özal E, Karapapak M, Kumantaş E, Özal SA. Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness. J Pediatr Ophthalmol Strabismus 2025; 62:84-95. [PMID: 39465590 DOI: 10.3928/01913913-20240911-05] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
PURPOSE To assess the appropriateness and readability of responses provided by four large language models (LLMs) (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to parents' queries pertaining to retinopathy of prematurity (ROP). METHODS A total of 60 frequently asked questions were collated and categorized into six distinct sections. The responses generated by the LLMs were evaluated by three experienced ROP specialists to determine their appropriateness and comprehensiveness. Additionally, the readability of the responses was assessed using a range of metrics, including the Flesch-Kincaid Grade Level (FKGL), Gunning Fog (GF) Index, Coleman-Liau (CL) Index, Simple Measure of Gobbledygook (SMOG) Index, and Flesch Reading Ease (FRE) score. RESULTS ChatGPT-4 demonstrated the highest level of appropriateness (100%) and performed exceptionally well in the Likert analysis, scoring 5 points on 96% of questions. The CL Index and FRE scores identified Gemini as the most readable LLM, whereas the GF Index and SMOG Index rated Microsoft Copilot as the most readable. Nevertheless, ChatGPT-4 exhibited the most intricate text structure, with scores of 18.56 on the GF Index, 18.56 on the CL Index, 17.2 on the SMOG Index, and 9.45 on the FRE score. This suggests that the responses demand a college-level comprehension. CONCLUSIONS ChatGPT-4 demonstrated higher performance than other LLMs in responding to questions related to ROP; however, its texts were more complex. In terms of readability, Gemini and Microsoft Copilot were found to be more successful. [J Pediatr Ophthalmol Strabismus. 2025;62(2):84-95.].
Collapse
|
29
|
Liu W, Wei H, Xiang L, Liu Y, Wang C, Hua Z. Bridging the Gap in Neonatal Care: Evaluating AI Chatbots for Chronic Neonatal Lung Disease and Home Oxygen Therapy Management. Pediatr Pulmonol 2025; 60:e71020. [PMID: 40042139 DOI: 10.1002/ppul.71020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 01/27/2025] [Accepted: 02/19/2025] [Indexed: 05/12/2025]
Abstract
OBJECTIVE To evaluate the accuracy and comprehensiveness of eight free, publicly available large language model (LLM) chatbots in addressing common questions related to chronic neonatal lung disease (CNLD) and home oxygen therapy (HOT). STUDY DESIGN Twenty CNLD and HOT-related questions were curated across nine domains. Responses from ChatGPT-3.5, Google Bard, Bing Chat, Claude 3.5 Sonnet, ERNIE Bot 3.5, and GLM-4 were generated and evaluated by three experienced neonatologists using Likert scales for accuracy and comprehensiveness. Updated LLM models (ChatGPT-4o mini and Gemini 2.0 Flash Experimental) were incorporated to assess rapid technological advancement. Statistical analyses included ANOVA, Kruskal-Wallis tests, and intraclass correlation coefficients. RESULTS Bing Chat and Claude 3.5 Sonnet demonstrated superior performance, with the highest mean accuracy scores (5.78 ± 0.48 and 5.75 ± 0.54, respectively) and competence scores (2.65 ± 0.58 and 2.80 ± 0.41, respectively). In subsequent testing, Gemini 2.0 Flash Experimental and ChatGPT-4o mini achieved comparable high performance. Performance varied across domains, with all models excelling in "equipment and safety protocols" and "caregiver support." ERNIE Bot 3.5 and GLM-4 showed self-correction capabilities when prompted. CONCLUSIONS LLMs promise accurate CNLD/HOT information. However, performance variability and the risk of misinformation necessitate expert oversight and continued refinement before widespread clinical implementation.
Collapse
Affiliation(s)
- Weiqin Liu
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
- Changdu People's Hospital of Xizang, Xizang, China
| | - Hong Wei
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| | - Lingling Xiang
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| | - Yin Liu
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
- Special Care Nursery, Port Moresby General Hospital, Port Moresby, Papua New Guinea
| | - Chunyi Wang
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| | - Ziyu Hua
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| |
Collapse
|
30
|
Oeding JF, Lu AZ, Mazzucco M, Fu MC, Taylor SA, Dines DM, Warren RF, Gulotta LV, Dines JS, Kunze KN. ChatGPT-4 Performs Clinical Information Retrieval Tasks Using Consistently More Trustworthy Resources Than Does Google Search for Queries Concerning the Latarjet Procedure. Arthroscopy 2025; 41:588-597. [PMID: 38936557 DOI: 10.1016/j.arthro.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 05/15/2024] [Accepted: 05/16/2024] [Indexed: 06/29/2024]
Abstract
PURPOSE To assess the ability of ChatGPT-4, an automated Chatbot powered by artificial intelligence, to answer common patient questions concerning the Latarjet procedure for patients with anterior shoulder instability and compare this performance with Google Search Engine. METHODS Using previously validated methods, a Google search was first performed using the query "Latarjet." Subsequently, the top 10 frequently asked questions (FAQs) and associated sources were extracted. ChatGPT-4 was then prompted to provide the top 10 FAQs and answers concerning the procedure. This process was repeated to identify additional FAQs requiring discrete-numeric answers to allow for a comparison between ChatGPT-4 and Google. Discrete, numeric answers were subsequently assessed for accuracy on the basis of the clinical judgment of 2 fellowship-trained sports medicine surgeons who were blinded to search platform. RESULTS Mean (± standard deviation) accuracy to numeric-based answers was 2.9 ± 0.9 for ChatGPT-4 versus 2.5 ± 1.4 for Google (P = .65). ChatGPT-4 derived information for answers only from academic sources, which was significantly different from Google Search Engine (P = .003), which used only 30% academic sources and websites from individual surgeons (50%) and larger medical practices (20%). For general FAQs, 40% of FAQs were found to be identical when comparing ChatGPT-4 and Google Search Engine. In terms of sources used to answer these questions, ChatGPT-4 again used 100% academic resources, whereas Google Search Engine used 60% academic resources, 20% surgeon personal websites, and 20% medical practices (P = .087). CONCLUSIONS ChatGPT-4 demonstrated the ability to provide accurate and reliable information about the Latarjet procedure in response to patient queries, using multiple academic sources in all cases. This was in contrast to Google Search Engine, which more frequently used single-surgeon and large medical practice websites. Despite differences in the resources accessed to perform information retrieval tasks, the clinical relevance and accuracy of information provided did not significantly differ between ChatGPT-4 and Google Search Engine. CLINICAL RELEVANCE Commercially available large language models (LLMs), such as ChatGPT-4, can perform diverse information retrieval tasks on-demand. An important medical information retrieval application for LLMs consists of the ability to provide comprehensive, relevant, and accurate information for various use cases such as investigation about a recently diagnosed medical condition or procedure. Understanding the performance and abilities of LLMs for use cases has important implications for deployment within health care settings.
Collapse
Affiliation(s)
- Jacob F Oeding
- School of Medicine, Mayo Clinic Alix School of Medicine, Rochester, Minnesota, U.S.A
| | - Amy Z Lu
- Weill Cornell College of Medicine, New York, New York, U.S.A
| | | | - Michael C Fu
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A.; Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Samuel A Taylor
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A.; Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - David M Dines
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A.; Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Russell F Warren
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A.; Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Lawrence V Gulotta
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A.; Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Joshua S Dines
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A.; Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Kyle N Kunze
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A.; Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A..
| |
Collapse
|
31
|
García-Rudolph A, Sanchez-Pinsach D, Caridad Fernandez M, Cunyat S, Opisso E, Hernandez-Pena E. How Chatbots Respond to NCLEX-RN Practice Questions: Assessment of Google Gemini, GPT-3.5, and GPT-4. Nurs Educ Perspect 2025; 46:E18-E20. [PMID: 39692545 DOI: 10.1097/01.nep.0000000000001364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2024]
Abstract
ABSTRACT ChatGPT often "hallucinates" or misleads, underscoring the need for formal validation at the professional level for reliable use in nursing education. We evaluated two free chatbots (Google Gemini and GPT-3.5) and a commercial version (GPT-4) on 250 standardized questions from a simulated nursing licensure exam, which closely matches the content and complexity of the actual exam. Gemini achieved 73.2 percent (183/250), GPT-3.5 achieved 72 percent (180/250), and GPT-4 reached a notably higher performance with 92.4 percent (231/250). GPT-4 exhibited its highest error rate (13.3%) in the psychosocial integrity category.
Collapse
Affiliation(s)
- Alejandro García-Rudolph
- About the Authors Alejandro García-Rudolph, PhD; David Sanchez-Pinsach, PhD; Mira Caridad Fernandez, MSc; Sandra Cunyat, MSc; Eloy Opisso, PhD; and Elena Hernandez-Pena, MSc, are faculty, Institut Guttmann Hospital de Neurorehabilitació, Barcelona, Spain. The authors are grateful to Olga Araujo of the Institut Guttmann-Documentation Office for her support in accessing the literature. For more information, contact Dr. Alejandro García-Rudolph at
| | | | | | | | | | | |
Collapse
|
32
|
Zeljkovic I, Novak A, Lisicic A, Jordan A, Serman A, Jurin I, Pavlovic N, Manola S. Beyond Text: The Impact of Clinical Context on GPT-4's 12-Lead Electrocardiogram Interpretation Accuracy. Can J Cardiol 2025:S0828-282X(25)00132-1. [PMID: 39971004 DOI: 10.1016/j.cjca.2025.01.036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 01/10/2025] [Accepted: 01/14/2025] [Indexed: 02/21/2025] Open
Abstract
BACKGROUND Artificial intelligence (AI) and large language models (LLMs), such as OpenAI's GPT-4, are increasingly being explored for medical applications. Recently, GPT-4 gained image processing capabilities, enabling it to handle tasks such as image captioning, visual question answering, and potentially interpreting medical data. Despite promising potential in diagnostics, the effectiveness of GPT-4 in interpreting complex 12-lead electrocardiograms (ECGs) remains to be assessed. METHODS This study utilized GPT-4 to interpret 150 12-lead ECGs from the Cardiology Research Dubrava (CaRD) registry, spanning a wide range of cardiac pathologies. The ECGs were classified into 4 categories for analysis: arrhythmias, conduction system abnormalities, acute coronary syndrome, and other. Two experiments were conducted: one where GPT-4 interpreted ECGs without clinical context, and another with added clinical scenarios. A panel of experienced cardiologists evaluated the accuracy of GPT-4's interpretations. RESULTS In this cross-sectional observational study, GPT-4 demonstrated a correct interpretation rate of 19% without clinical context and a significantly improved rate of 45% with context (P < 0.001). The addition of clinical scenarios significantly enhanced interpretative accuracy, particularly in the acute coronary syndrome category (10% vs 70%; P < 0.0.01). The "other" category showed no impact (51% vs 59%; P = 0.640), and trends toward significance were observed in the arrhythmias (9.7% vs 32%; P = 0.059) and conduction system abnormalities (4.8% vs 19%; P = 0.088) categories when given clinical context. CONCLUSIONS Although GPT-4 shows potential in aiding 12-lead ECG interpretation, its effectiveness varies significantly with clinical context. The study suggests that GPT-4 alone in its current form may not provide accurate 12-lead ECG interpretation.
Collapse
Affiliation(s)
- Ivan Zeljkovic
- Dubrava University Hospital, Zagreb, Croatia; Catholic University of Croatia, Zagreb, Croatia. https://twitter.com/i_zeljkovic
| | - Andrej Novak
- Dubrava University Hospital, Zagreb, Croatia; Department of Mathematics, University of Vienna, Vienna, Austria; Luxembourg School of Business, Luxembourg.
| | | | - Ana Jordan
- Dubrava University Hospital, Zagreb, Croatia
| | - Ana Serman
- School of Medicine, University of Zagreb, Zagreb, Croatia
| | - Ivana Jurin
- Dubrava University Hospital, Zagreb, Croatia
| | | | - Sime Manola
- Dubrava University Hospital, Zagreb, Croatia
| |
Collapse
|
33
|
Fattah FH, Salih AM, Salih AM, Asaad SK, Ghafour AK, Bapir R, Abdalla BA, Othman S, Ahmed SM, Hasan SJ, Mahmood YM, Kakamad FH. Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review. Front Digit Health 2025; 7:1482712. [PMID: 39963119 PMCID: PMC11830737 DOI: 10.3389/fdgth.2025.1482712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Accepted: 01/21/2025] [Indexed: 02/20/2025] Open
Abstract
Introduction Artificial intelligence and machine learning are popular interconnected technologies. AI chatbots like ChatGPT and Gemini show considerable promise in medical inquiries. This scoping review aims to assess the accuracy and response length (in characters) of ChatGPT and Gemini in medical applications. Methods The eligible databases were searched to find studies published in English from January 1 to October 20, 2023. The inclusion criteria consisted of studies that focused on using AI in medicine and assessed outcomes based on the accuracy and character count (length) of ChatGPT and Gemini. Data collected from the studies included the first author's name, the country where the study was conducted, the type of study design, publication year, sample size, medical speciality, and the accuracy and response length. Results The initial search identified 64 papers, with 11 meeting the inclusion criteria, involving 1,177 samples. ChatGPT showed higher accuracy in radiology (87.43% vs. Gemini's 71%) and shorter responses (907 vs. 1,428 characters). Similar trends were noted in other specialties. However, Gemini outperformed ChatGPT in emergency scenarios (87% vs. 77%) and in renal diets with low potassium and high phosphorus (79% vs. 60% and 100% vs. 77%). Statistical analysis confirms that ChatGPT has greater accuracy and shorter responses than Gemini in medical studies, with a p-value of <.001 for both metrics. Conclusion This Scoping review suggests that ChatGPT may demonstrate higher accuracy and provide shorter responses than Gemini in medical studies.
Collapse
Affiliation(s)
- Fattah H. Fattah
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
- College of Medicine, University of Sulaimani, Sulaymaniyah, Iraq
| | - Abdulwahid M. Salih
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
- College of Medicine, University of Sulaimani, Sulaymaniyah, Iraq
| | - Ameer M. Salih
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
- Civil Engineering Department, College of Engineering, University of Sulaimani, Sulaymaniyah, Iraq
| | - Saywan K. Asaad
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
- College of Medicine, University of Sulaimani, Sulaymaniyah, Iraq
| | | | - Rawa Bapir
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
- Department of Urology, Sulaimani Surgical Teaching Hospital, Sulaymaniyah, Iraq
- Kscien Organization for Scientific Research (Middle East Office), Sulaymaniyah, Iraq
| | - Berun A. Abdalla
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
- Kscien Organization for Scientific Research (Middle East Office), Sulaymaniyah, Iraq
| | - Snur Othman
- Kscien Organization for Scientific Research (Middle East Office), Sulaymaniyah, Iraq
| | - Sasan M. Ahmed
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
- Kscien Organization for Scientific Research (Middle East Office), Sulaymaniyah, Iraq
| | - Sabah Jalal Hasan
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
| | - Yousif M. Mahmood
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
| | - Fahmi H. Kakamad
- Scientific Affairs Department, Smart Health Tower, Sulaymaniyah, Iraq
- College of Medicine, University of Sulaimani, Sulaymaniyah, Iraq
- Kscien Organization for Scientific Research (Middle East Office), Sulaymaniyah, Iraq
| |
Collapse
|
34
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
35
|
Ding H, Xia W, Zhou Y, Wei L, Feng Y, Wang Z, Song X, Li R, Mao Q, Chen B, Wang H, Huang X, Zhu B, Jiang D, Sun J, Dong G, Jiang F. Evaluation and practical application of prompt-driven ChatGPTs for EMR generation. NPJ Digit Med 2025; 8:77. [PMID: 39894840 PMCID: PMC11788423 DOI: 10.1038/s41746-025-01472-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Accepted: 01/19/2025] [Indexed: 02/04/2025] Open
Abstract
This study investigates the application of prompt engineering to optimize prompt-driven ChatGPT for generating electronic medical records (EMRs) during lung nodule screening. We assessed the performance of ChatGPT in generating EMRs from patient-provider verbal consultations and integrated this approach into practical tools, such as WeChat mini-programs, accessible to patients before hospital visits. The findings highlight ChatGPT's potential to enhance workflow efficiency and improve diagnostic processes in clinical settings.
Collapse
Affiliation(s)
- Hanlin Ding
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
- The Fourth Clinical College of Nanjing Medical University, Nanjing, China
| | - Wenjie Xia
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
| | - Yujia Zhou
- The Second Clinical Medical School of Nanjing Medical University, Nanjing, China
| | - Lei Wei
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
- The Fourth Clinical College of Nanjing Medical University, Nanjing, China
- Department of Cardiothoracic Surgery, Jinling Hospital, Nanjing University School of Medicine, Nanjing, China
| | - Yipeng Feng
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
- The Fourth Clinical College of Nanjing Medical University, Nanjing, China
| | - Zi Wang
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
- The Fourth Clinical College of Nanjing Medical University, Nanjing, China
| | - Xuming Song
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
- The Fourth Clinical College of Nanjing Medical University, Nanjing, China
| | - Rutao Li
- Department of Thoracic Surgery, Dushu Lake Hospital Affiliated to Soochow University, Suzhou, China
| | - Qixing Mao
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
| | - Bing Chen
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
| | - Hui Wang
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China
| | - Xing Huang
- Pathological Department of Jiangsu Cancer Hospital, Nanjing, P. R. China
| | - Bin Zhu
- Hospital Development Management Office, Nanjing Medical University, Nanjing, China
| | - Dongyu Jiang
- Department of Orthopedics, Wuxi People's Hospital Affiliated to Nanjing Medical University, Wuxi, China
| | - Jingyu Sun
- Department of Cardiology, First Affiliated Hospital of Nanjing Medical University, Jiangsu Province Hospital, Nanjing, China
| | - Gaochao Dong
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China.
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China.
- The Fourth Clinical College of Nanjing Medical University, Nanjing, China.
| | - Feng Jiang
- Department of Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital & Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, 21009, Nanjing, China.
- Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Cancer Institute of Jiangsu Province, Nanjing, China.
- The Fourth Clinical College of Nanjing Medical University, Nanjing, China.
| |
Collapse
|
36
|
Chervonski E, Harish KB, Rockman CB, Sadek M, Teter KA, Jacobowitz GR, Berland TL, Lohr J, Moore C, Maldonado TS. Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients. Vascular 2025; 33:229-237. [PMID: 38500300 DOI: 10.1177/17085381241240550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
OBJECTIVES Generative artificial intelligence (AI) has emerged as a promising tool to engage with patients. The objective of this study was to assess the quality of AI responses to common patient questions regarding vascular surgery disease processes. METHODS OpenAI's ChatGPT-3.5 and Google Bard were queried with 24 mock patient questions spanning seven vascular surgery disease domains. Six experienced vascular surgery faculty at a tertiary academic center independently graded AI responses on their accuracy (rated 1-4 from completely inaccurate to completely accurate), completeness (rated 1-4 from totally incomplete to totally complete), and appropriateness (binary). Responses were also evaluated with three readability scales. RESULTS ChatGPT responses were rated, on average, more accurate than Bard responses (3.08 ± 0.33 vs 2.82 ± 0.40, p < .01). ChatGPT responses were scored, on average, more complete than Bard responses (2.98 ± 0.34 vs 2.62 ± 0.36, p < .01). Most ChatGPT responses (75.0%, n = 18) and almost half of Bard responses (45.8%, n = 11) were unanimously deemed appropriate. Almost one-third of Bard responses (29.2%, n = 7) were deemed inappropriate by at least two reviewers (29.2%), and two Bard responses (8.4%) were considered inappropriate by the majority. The mean Flesch Reading Ease, Flesch-Kincaid Grade Level, and Gunning Fog Index of ChatGPT responses were 29.4 ± 10.8, 14.5 ± 2.2, and 17.7 ± 3.1, respectively, indicating that responses were readable with a post-secondary education. Bard's mean readability scores were 58.9 ± 10.5, 8.2 ± 1.7, and 11.0 ± 2.0, respectively, indicating that responses were readable with a high-school education (p < .0001 for three metrics). ChatGPT's mean response length (332 ± 79 words) was higher than Bard's mean response length (183 ± 53 words, p < .001). There was no difference in the accuracy, completeness, readability, or response length of ChatGPT or Bard between disease domains (p > .05 for all analyses). CONCLUSIONS AI offers a novel means of educating patients that avoids the inundation of information from "Dr Google" and the time barriers of physician-patient encounters. ChatGPT provides largely valid, though imperfect, responses to myriad patient questions at the expense of readability. While Bard responses are more readable and concise, their quality is poorer. Further research is warranted to better understand failure points for large language models in vascular surgery patient education.
Collapse
Affiliation(s)
- Ethan Chervonski
- New York University Grossman School of Medicine, New York, NY, USA
| | - Keerthi B Harish
- New York University Grossman School of Medicine, New York, NY, USA
| | - Caron B Rockman
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Mikel Sadek
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Katherine A Teter
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Glenn R Jacobowitz
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Todd L Berland
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Joann Lohr
- Dorn Veterans Affairs Medical Center, Columbia, SC, USA
| | | | - Thomas S Maldonado
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| |
Collapse
|
37
|
Aziz AAA, Abdelrahman HH, Hassan MG. The use of ChatGPT and Google Gemini in responding to orthognathic surgery-related questions: A comparative study. J World Fed Orthod 2025; 14:20-26. [PMID: 39490358 DOI: 10.1016/j.ejwf.2024.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 09/05/2024] [Accepted: 09/05/2024] [Indexed: 11/05/2024]
Abstract
AIM This study employed a quantitative approach to compare the reliability of responses provided by ChatGPT-3.5, ChatGPT-4, and Google Gemini in response to orthognathic surgery-related questions. MATERIAL AND METHODS The authors adapted a set of 64 questions encompassing all of the domains and aspects related to orthognathic surgery. One author submitted the questions to ChatGPT3.5, ChatGPT4, and Google Gemini. The AI-generated responses from the three platforms were recorded and evaluated by 2 blinded and independent experts. The reliability of AI-generated responses was evaluated using a tool for accuracy of information and completeness. In addition, the provision of definitive answers to close-ended questions, references, graphical elements, and advice to schedule consultations with a specialist were collected. RESULTS Although ChatGPT-3.5 achieved the highest information reliability score, the 3 LLMs showed similar reliability scores in providing responses to orthognathic surgery-related inquiries. Moreover, Google Gemini significantly included physician recommendations and provided graphical elements. Both ChatGPT-3.5 and -4 lacked these features. CONCLUSION This study shows that ChatGPT-3.5, ChatGPT-4, and Google Gemini can provide reliable responses to inquires about orthognathic surgery. However, Google Gemini stood out by incorporating additional references and illustrations within its responses. These findings highlight the need for an additional evaluation of AI capabilities across different healthcare domains.
Collapse
Affiliation(s)
- Ahmed A Abdel Aziz
- Department of Orthodontics, Faculty of Dentistry, Assiut University, Assiut, Egypt
| | - Hams H Abdelrahman
- Department of Pediatric Dentistry and Dental Public Health, Faculty of Dentistry, Alexandria University, Alexandria, Egypt
| | - Mohamed G Hassan
- Department of Orthodontics, Faculty of Dentistry, Assiut University, Assiut, Egypt; Division of Bone and Mineral Diseases, Department of Medicine, School of Medicine, Washington University in St. Louis, St. Louis, Missouri.
| |
Collapse
|
38
|
Zhang Q, Wu Z, Song J, Luo S, Chai Z. Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health. Int Dent J 2025; 75:151-157. [PMID: 39147663 PMCID: PMC11806297 DOI: 10.1016/j.identj.2024.06.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 06/12/2024] [Accepted: 06/19/2024] [Indexed: 08/17/2024] Open
Abstract
AIM Given the increasing interest in using large language models (LLMs) for self-diagnosis, this study aimed to evaluate the comprehensiveness of two prominent LLMs, ChatGPT-3.5 and ChatGPT-4, in addressing common queries related to gingival and endodontic health across different language contexts and query types. METHODS We assembled a set of 33 common real-life questions related to gingival and endodontic healthcare, including 17 common-sense questions and 16 expert questions. Each question was presented to the LLMs in both English and Chinese. Three specialists were invited to evaluate the comprehensiveness of the responses on a five-point Likert scale, where a higher score indicated greater quality responses. RESULTS LLMs performed significantly better in English, with an average score of 4.53, compared to 3.95 in Chinese (Mann-Whitney U test, P < .05). Responses to common sense questions received higher scores than those to expert questions, with averages of 4.46 and 4.02 (Mann-Whitney U test, P < .05). Among the LLMs, ChatGPT-4 consistently outperformed ChatGPT-3.5, achieving average scores of 4.45 and 4.03 (Mann-Whitney U test, P < .05). CONCLUSIONS ChatGPT-4 provides more comprehensive responses than ChatGPT-3.5 for queries related to gingival and endodontic health. Both LLMs perform better in English and on common sense questions. However, the performance discrepancies across different language contexts and the presence of inaccurate responses suggest that further evaluation and understanding of their limitations are crucial to avoid potential misunderstandings. CLINICAL RELEVANCE This study revealed the performance differences of ChatGPT-3.5 and ChatGPT-4 in handling gingival and endodontic health issues across different language contexts, providing insights into the comprehensiveness and limitations of LLMs in addressing common oral healthcare queries.
Collapse
Affiliation(s)
- Qian Zhang
- College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - Zhengyu Wu
- College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - Jinlin Song
- College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China
| | - Shuicai Luo
- Quanzhou Institute of Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences, Quanzhou, China
| | - Zhaowu Chai
- College of Stomatology, Chongqing Medical University, Chongqing, China; Chongqing Key Laboratory for Oral Diseases and Biomedical Sciences, Chongqing, China; Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education, Chongqing, China.
| |
Collapse
|
39
|
Volk SC, Schäfer MS, Lombardi D, Mahl D, Yan X. How generative artificial intelligence portrays science: Interviewing ChatGPT from the perspective of different audience segments. PUBLIC UNDERSTANDING OF SCIENCE (BRISTOL, ENGLAND) 2025; 34:132-153. [PMID: 39344088 PMCID: PMC11783972 DOI: 10.1177/09636625241268910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/01/2024]
Abstract
Generative artificial intelligence in general and ChatGPT in particular have risen in importance. ChatGPT is widely known and used increasingly as an information source for different topics, including science. It is therefore relevant to examine how ChatGPT portrays science and science-related issues. Research on this question is lacking, however. Hence, we simulate "interviews" with ChatGPT and reconstruct how it presents science, science communication, scientific misbehavior, and controversial scientific issues. Combining qualitative and quantitative content analysis, we find that, generally, ChatGPT portrays science largely as the STEM disciplines, in a positivist-empiricist way and a positive light. When comparing ChatGPT's responses to different simulated user profiles and responses from the GPT-3.5 and GPT-4 versions, we find similarities in that the scientific consensus on questions such as climate change, COVID-19 vaccinations, or astrology is consistently conveyed across them. Beyond these similarities in substance, however, pronounced differences are found in the personalization of responses to different user profiles and between GPT-3.5 and GPT-4.
Collapse
|
40
|
Aldukhail S. Mapping the Landscape of Generative Language Models in Dental Education: A Comparison Between ChatGPT and Google Bard. EUROPEAN JOURNAL OF DENTAL EDUCATION : OFFICIAL JOURNAL OF THE ASSOCIATION FOR DENTAL EDUCATION IN EUROPE 2025; 29:136-148. [PMID: 39563479 DOI: 10.1111/eje.13056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Revised: 08/29/2024] [Accepted: 10/28/2024] [Indexed: 11/21/2024]
Abstract
Generative language models (LLMs) have shown great potential in various fields, including medicine and education. This study evaluated and compared ChatGPT 3.5 and Google Bard within dental education and research. METHODS We developed seven dental education-related queries to assess each model across various domains: their role in dental education, creation of specific exercises, simulations of dental problems with treatment options, development of assessment tools, proficiency in dental literature and their ability to identify, summarise and critique a specific article. Two blind reviewers scored the responses using defined metrics. The means and standard deviations of the scores were reported, and differences between the scores were analysed using Wilcoxon tests. RESULTS ChatGPT 3.5 outperformed Bard in several tasks, including the ability to create highly comprehensive, accurate, clear, relevant and specific exercises on dental concepts, generate simulations of dental problems with treatment options and develop assessment tools. On the other hand, Bard was successful in retrieving real research, and it was able to critique the article it selected. Statistically significant differences were noted between the average scores of the two models (p ≤ 0.05) for domains 1 and 3. CONCLUSION This study highlights the potential of LLMs as dental education tools, enhancing learning through virtual simulations and critical performance analysis. However, the variability in LLMs' performance underscores the need for targeted training, particularly in evidence-based content generation. It is crucial for educators, students and practitioners to exercise caution when considering the delegation of critical educational or healthcare decisions to computer systems.
Collapse
Affiliation(s)
- Shaikha Aldukhail
- Department of Preventive dental sciences, college of dentistry, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| |
Collapse
|
41
|
Zhu K, Zhang J, Klishin A, Esser M, Blumentals WA, Juhaeri J, Jouquelet‐Royer C, Sinnott S. Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology. Pharmacoepidemiol Drug Saf 2025; 34:e70111. [PMID: 39901360 PMCID: PMC11791122 DOI: 10.1002/pds.70111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 01/15/2025] [Accepted: 01/16/2025] [Indexed: 02/05/2025]
Abstract
PURPOSE Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency. METHODS A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting "gold-standard" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated. RESULTS Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references. CONCLUSIONS ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.
Collapse
Affiliation(s)
- Kexin Zhu
- Epidemiology and Benefit Risk, SanofiBridgewaterNew JerseyUSA
| | | | | | - Mario Esser
- Global PharmacovigilanceSanofiFrankfurtGermany
| | | | - Juhaeri Juhaeri
- Epidemiology and Benefit Risk, SanofiBridgewaterNew JerseyUSA
| | | | | |
Collapse
|
42
|
Kon MHA, Pereira MJ, Molina JADC, Yip VCH, Abisheganaden JA, Yip W. Unravelling ChatGPT's potential in summarising qualitative in-depth interviews. Eye (Lond) 2025; 39:354-358. [PMID: 39501005 PMCID: PMC11751335 DOI: 10.1038/s41433-024-03419-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 10/13/2024] [Accepted: 10/17/2024] [Indexed: 01/23/2025] Open
Abstract
BACKGROUND/OBJECTIVES Qualitative research can be laborious and time consuming, presenting a challenge for practitioners and policymakers seeking rapid, actionable results. Data collection, transcription and analysis are the main contributors to the resource-intensive nature. OpenAI's Chat Generative Pre-trained Transformer (ChatGPT), have demonstrated potential to aid in data analysis. Our study aimed to compare themes generated by ChatGPT (3.5 and 4.0) with traditional human analysis from in-depth interviews. METHODS Three transcripts from an evaluation study to understand patients' experiences at a community eye clinic were used. Transcripts were first analysed by an independent researcher. Next, specific aims, instructions and de-identified transcripts were uploaded to ChatGPT 3.5 and ChatGPT 4.0. Concordance in the themes was calculated as the number of themes generated by ChatGPT divided by the number of themes generated by the researcher. The number of unrelated subthemes and time taken by both ChatGPT were also described. RESULTS The average time taken per transcript was 11.5 min, 11.9 min and 240 min for ChatGPT 3.5, ChatGPT 4.0 and researcher respectively. Six themes were identified by the researcher: (i) clinic's accessibility, (ii) patients' awareness, (iii) trust and satisfaction, (iv) patients' expectations, (v) willingness to return and (vi) explanation of the clinic by referral source. Concordance for ChatGPT 3.5 and 4.0 ranged from 66 to 100%. CONCLUSION Preliminary results showed that ChatGPT significantly reduced analysis time with moderate to good concordance compared with current practice. This highlighted the potential adoption of ChatGPT to facilitate rapid preliminary analysis. However, regrouping of subthemes will still need to be conducted by a researcher.
Collapse
Affiliation(s)
- Mei Hui Adeline Kon
- National University of Ireland, Galway, Ireland
- Health Services and Outcomes Research, National Healthcare Group, Singapore, Singapore
| | | | | | - Vivien Cherng Hui Yip
- National Healthcare Group Eye Institute, Tan Tock Seng Hospital, Singapore, Singapore
| | | | - WanFen Yip
- Health Services and Outcomes Research, National Healthcare Group, Singapore, Singapore.
| |
Collapse
|
43
|
Wang J, Shi R, Le Q, Shan K, Chen Z, Zhou X, He Y, Hong J. Evaluating the effectiveness of large language models in patient education for conjunctivitis. Br J Ophthalmol 2025; 109:185-191. [PMID: 39214677 DOI: 10.1136/bjo-2024-325599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 08/03/2024] [Indexed: 09/04/2024]
Abstract
AIMS To evaluate the quality of responses from large language models (LLMs) to patient-generated conjunctivitis questions. METHODS A two-phase, cross-sectional study was conducted at the Eye and ENT Hospital of Fudan University. In phase 1, four LLMs (GPT-4, Qwen, Baichuan 2 and PaLM 2) responded to 22 frequently asked conjunctivitis questions. Six expert ophthalmologists assessed these responses using a 5-point Likert scale for correctness, completeness, readability, helpfulness and safety, supplemented by objective readability analysis. Phase 2 involved 30 conjunctivitis patients who interacted with GPT-4 or Qwen, evaluating the LLM-generated responses based on satisfaction, humanisation, professionalism and the same dimensions except for correctness from phase 1. Three ophthalmologists assessed responses using phase 1 criteria, allowing for a comparative analysis between medical and patient evaluations, probing the study's practical significance. RESULTS In phase 1, GPT-4 excelled across all metrics, particularly in correctness (4.39±0.76), completeness (4.31±0.96) and readability (4.65±0.59) while Qwen showed similarly strong performance in helpfulness (4.37±0.93) and safety (4.25±1.03). Baichuan 2 and PaLM 2 were effective but trailed behind GPT-4 and Qwen. The objective readability analysis revealed GPT-4's responses as the most detailed, with PaLM 2's being the most succinct. Phase 2 demonstrated GPT-4 and Qwen's robust performance, with high satisfaction levels and consistent evaluations from both patients and professionals. CONCLUSIONS Our study showed LLMs effectively improve patient education in conjunctivitis. These models showed considerable promise in real-world patient interactions. Despite encouraging results, further refinement, particularly in personalisation and handling complex inquiries, is essential prior to the clinical integration of these LLMs.
Collapse
Affiliation(s)
- Jingyuan Wang
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, People's Republic of China
| | - Runhan Shi
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, People's Republic of China
| | - Qihua Le
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, People's Republic of China
| | - Kun Shan
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, People's Republic of China
| | - Zhi Chen
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, People's Republic of China
| | - Xujiao Zhou
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, People's Republic of China
| | - Yao He
- Macao Translatoinal Medicine Center, Macau University of Science and Technology, Taipa, Macau SAR, Macau, People's Republic of China
| | - Jiaxu Hong
- Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, People's Republic of China
- NHC Key laboratory of Myopia and Related Eye Diseases, Shanghai, People's Republic of China
- Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, People's Republic of China
- Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, People's Republic of China
| |
Collapse
|
44
|
Ma R, Cheng Q, Yao J, Peng Z, Yan M, Lu J, Liao J, Tian L, Shu W, Zhang Y, Wang J, Jiang P, Xia W, Li X, Gan L, Zhao Y, Zhu J, Qin B, Jiang Q, Wang X, Lin X, Chen H, Zhu W, Xiang D, Nie B, Wang J, Guo J, Xue K, Cui H, Cheng J, Zhu X, Hong J, Shi F, Zhang R, Chen X, Zhao C. Multimodal machine learning enables AI chatbot to diagnose ophthalmic diseases and provide high-quality medical responses. NPJ Digit Med 2025; 8:64. [PMID: 39870855 PMCID: PMC11772878 DOI: 10.1038/s41746-025-01461-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Accepted: 01/15/2025] [Indexed: 01/29/2025] Open
Abstract
Chatbot-based multimodal AI holds promise for collecting medical histories and diagnosing ophthalmic diseases using textual and imaging data. This study developed and evaluated the ChatGPT-powered Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) to enable patient self-diagnosis and self-triage. IOMIDS included a text model and three multimodal models (text + slit-lamp, text + smartphone, text + slit-lamp + smartphone). The performance was evaluated through a two-stage cross-sectional study across three medical centers involving 10 subspecialties and 50 diseases. Using 15640 data entries, IOMIDS actively collected and analyzed medical history alongside slit-lamp and/or smartphone images. The text + smartphone model showed the highest diagnostic accuracy (internal: 79.6%, external: 81.1%), while other multimodal models underperformed or matched the text model (internal: 69.6%, external: 72.5%). Moreover, triage accuracy was consistent across models. Multimodal approaches enhanced response quality and reduced misinformation. This proof-of-concept study highlights the potential of chatbot-based multimodal AI for self-diagnosis and self-triage. (The clinical trial was registered on June 26, 2023, on ClinicalTrials.gov under the registration number NCT05930444.).
Collapse
Affiliation(s)
- Ruiqi Ma
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Qian Cheng
- MIPAV Lab, School of Electronics and Information Engineering, Soochow University, Suzhou, China
| | - Jing Yao
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Zhiyu Peng
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
- Department of Ophthalmology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Mingxu Yan
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
- School of Basic Medical Sciences, Fudan University, Shanghai, China
| | - Jie Lu
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
- School of Basic Medical Sciences, Fudan University, Shanghai, China
| | - Jingjing Liao
- MIPAV Lab, School of Electronics and Information Engineering, Soochow University, Suzhou, China
| | - Lejin Tian
- State Key Laboratory of Genetic Engineering, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Wenjun Shu
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
- School of Basic Medical Sciences, Fudan University, Shanghai, China
| | - Yunqiu Zhang
- Department of Epidemiology, School of Public Health, and The Key Laboratory of Public Health Safety of Ministry of Education, Fudan University, Shanghai, China
| | - Jinghan Wang
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Pengfei Jiang
- Shanghai Jiao Tong University Instrument Analysis Center, Shanghai, China
| | - Weiyi Xia
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Xiaofeng Li
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Lu Gan
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Yue Zhao
- The Affiliated Eye Hospital, Nanjing Medical University, Nanjing, China
| | - Jiang Zhu
- Department of Ophthalmology, Suqian First Hospital, Suqian, China
| | - Bing Qin
- Department of Ophthalmology, Suqian First Hospital, Suqian, China
| | - Qin Jiang
- The Affiliated Eye Hospital, Nanjing Medical University, Nanjing, China
- The Fourth School of Clinical Medicine, Nanjing Medical University, Nanjing, China
| | - Xiawei Wang
- Department of Ophthalmology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Xintong Lin
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Haifeng Chen
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Weifang Zhu
- MIPAV Lab, School of Electronics and Information Engineering, Soochow University, Suzhou, China
| | - Dehui Xiang
- MIPAV Lab, School of Electronics and Information Engineering, Soochow University, Suzhou, China
| | - Baoqing Nie
- MIPAV Lab, School of Electronics and Information Engineering, Soochow University, Suzhou, China
| | - Jingtao Wang
- MIPAV Lab, School of Electronics and Information Engineering, Soochow University, Suzhou, China
| | - Jie Guo
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Kang Xue
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Hongguang Cui
- Department of Ophthalmology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Jinwei Cheng
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Xiangjia Zhu
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Jiaxu Hong
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China
| | - Fei Shi
- MIPAV Lab, School of Electronics and Information Engineering, Soochow University, Suzhou, China.
| | - Rui Zhang
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China.
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China.
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China.
| | - Xinjian Chen
- MIPAV Lab, School of Electronics and Information Engineering, Soochow University, Suzhou, China.
- State Key Laboratory of Radiation Medicine and Protection, Soochow University, Suzhou, China.
| | - Chen Zhao
- Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China.
- NHC Key Laboratory of Myopia and Related Eye Diseases; Key Laboratory of Myopia and Related Eye Diseases, Chinese Academy of Medical Sciences, Shanghai, China.
- Shanghai Key Laboratory of Visual Impairment and Restoration, Shanghai, China.
| |
Collapse
|
45
|
Li Y, Huang CK, Hu Y, Zhou XD, He C, Zhong JW. Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study. World J Gastroenterol 2025; 31:101092. [PMID: 39839898 PMCID: PMC11684168 DOI: 10.3748/wjg.v31.i3.101092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 10/29/2024] [Accepted: 12/03/2024] [Indexed: 12/20/2024] Open
Abstract
BACKGROUND Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients. AIM To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions. METHODS LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed via the Gunning Fog index and Flesch-Kincaid grade level. RESULTS Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population. CONCLUSION Our results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.
Collapse
Affiliation(s)
- Yu Li
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
- HuanKui Academy, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Chen-Kai Huang
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Yi Hu
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Xiao-Dong Zhou
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Cong He
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| | - Jia-Wei Zhong
- Department of Gastroenterology, Jiangxi Provincial Key Laboratory of Digestive Diseases, Jiangxi Clinical Research Center for Gastroenterology, Digestive Disease Hospital, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang 330006, Jiangxi Province, China
| |
Collapse
|
46
|
García-Rudolph A, Sanchez-Pinsach D, Opisso E. Evaluating AI Models: Performance Validation Using Formal Multiple-Choice Questions in Neuropsychology. Arch Clin Neuropsychol 2025; 40:150-155. [PMID: 39231527 DOI: 10.1093/arclin/acae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 08/13/2024] [Accepted: 08/19/2024] [Indexed: 09/06/2024] Open
Abstract
High-quality and accessible education is crucial for advancing neuropsychology. A recent study identified key barriers to board certification in clinical neuropsychology, such as time constraints and insufficient specialized knowledge. To address these challenges, this study explored the capabilities of advanced Artificial Intelligence (AI) language models, GPT-3.5 (free-version) and GPT-4.0 (under-subscription version), by evaluating their performance on 300 American Board of Professional Psychology in Clinical Neuropsychology-like questions. The results indicate that GPT-4.0 achieved a higher accuracy rate of 80.0% compared to GPT-3.5's 65.7%. In the "Assessment" category, GPT-4.0 demonstrated a notable improvement with an accuracy rate of 73.4% compared to GPT-3.5's 58.6% (p = 0.012). The "Assessment" category, which comprised 128 questions and exhibited the highest error rate by both AI models, was analyzed. A thematic analysis of the 26 incorrectly answered questions revealed 8 main themes and 17 specific codes, highlighting significant gaps in areas such as "Neurodegenerative Diseases" and "Neuropsychological Testing and Interpretation."
Collapse
Affiliation(s)
- Alejandro García-Rudolph
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - David Sanchez-Pinsach
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - Eloy Opisso
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| |
Collapse
|
47
|
García-Rudolph A, Sanchez-Pinsach D, Wright MA, Opisso E, Vidal J. Assessing readability of explanations and reliability of answers by GPT-3.5 and GPT-4 in non-traumatic spinal cord injury education. MEDICAL TEACHER 2025:1-8. [PMID: 39832525 DOI: 10.1080/0142159x.2024.2430365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Accepted: 11/12/2024] [Indexed: 01/22/2025]
Abstract
PURPOSE Our study aimed to: i) Assess the readability of textbook explanations using established indexes; ii) Compare these with GPT-4's default explanations, ensuring similar word counts for direct comparisons; iii) Evaluate GPT-4's adaptability by simplifying high-complexity explanations; iv) Determine the reliability of GPT-3.5 and GPT-4 in providing accurate answers. MATERIAL AND METHODS We utilized a textbook designed for ABPMR certification. Our analysis covered 50 multiple-choice questions, each with a detailed explanation, focusing on non-traumatic spinal cord injury (NTSCI). RESULTS Our analysis revealed statistically significant differences in readability scores, with the textbook achieving 14.5 (SD = 2.5) compared to GPT-4's 17.3 (SD = 1.9), indicating that GPT-4's explanations are generally more complex (p < 0.001). Using the Flesch Reading Ease Score, 86% of GPT-4's explanations fell into the 'Very difficult' category, significantly higher than the textbook's 58% (p = 0.006). GPT-4 successfully demonstrated adaptability by reducing the mean readability score of the top-nine most complex explanations, maintaining the word count. Regarding reliability, GPT-3.5 and GPT-4 scored 84% and 96% respectively, with GPT-4 outperforming GPT-3.5 (p = 0.046). CONCLUSIONS Our results confirmed GPT-4's potential in medical education by providing highly accurate yet often complex explanations for NTSCI, which were successfully simplified without losing accuracy.
Collapse
Affiliation(s)
- Alejandro García-Rudolph
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Bellaterra, Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Barcelona, Spain
| | - David Sanchez-Pinsach
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Bellaterra, Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Barcelona, Spain
| | - Mark Andrew Wright
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Bellaterra, Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Barcelona, Spain
| | - Eloy Opisso
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Bellaterra, Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Barcelona, Spain
| | - Joan Vidal
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Bellaterra, Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Barcelona, Spain
| |
Collapse
|
48
|
Yang X, Li T, Su Q, Liu Y, Kang C, Lyu Y, Zhao L, Nie Y, Pan Y. Application of large language models in disease diagnosis and treatment. Chin Med J (Engl) 2025; 138:130-142. [PMID: 39722188 PMCID: PMC11745858 DOI: 10.1097/cm9.0000000000003456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Indexed: 12/28/2024] Open
Abstract
ABSTRACT Large language models (LLMs) such as ChatGPT, Claude, Llama, and Qwen are emerging as transformative technologies for the diagnosis and treatment of various diseases. With their exceptional long-context reasoning capabilities, LLMs are proficient in clinically relevant tasks, particularly in medical text analysis and interactive dialogue. They can enhance diagnostic accuracy by processing vast amounts of patient data and medical literature and have demonstrated their utility in diagnosing common diseases and facilitating the identification of rare diseases by recognizing subtle patterns in symptoms and test results. Building on their image-recognition abilities, multimodal LLMs (MLLMs) show promising potential for diagnosis based on radiography, chest computed tomography (CT), electrocardiography (ECG), and common pathological images. These models can also assist in treatment planning by suggesting evidence-based interventions and improving clinical decision support systems through integrated analysis of patient records. Despite these promising developments, significant challenges persist regarding the use of LLMs in medicine, including concerns regarding algorithmic bias, the potential for hallucinations, and the need for rigorous clinical validation. Ethical considerations also underscore the importance of maintaining the function of supervision in clinical practice. This paper highlights the rapid advancements in research on the diagnostic and therapeutic applications of LLMs across different medical disciplines and emphasizes the importance of policymaking, ethical supervision, and multidisciplinary collaboration in promoting more effective and safer clinical applications of LLMs. Future directions include the integration of proprietary clinical knowledge, the investigation of open-source and customized models, and the evaluation of real-time effects in clinical diagnosis and treatment practices.
Collapse
Affiliation(s)
- Xintian Yang
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Tongxin Li
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Qin Su
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yaling Liu
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Chenxi Kang
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yong Lyu
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Lina Zhao
- Department of Radiotherapy, Xijing Hospital, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yongzhan Nie
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yanglin Pan
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| |
Collapse
|
49
|
Su Z, Jin K, Wu H, Luo Z, Grzybowski A, Ye J. Assessment of Large Language Models in Cataract Care Information Provision: A Quantitative Comparison. Ophthalmol Ther 2025; 14:103-116. [PMID: 39516445 PMCID: PMC11724831 DOI: 10.1007/s40123-024-01066-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 10/24/2024] [Indexed: 11/16/2024] Open
Abstract
INTRODUCTION Cataracts are a significant cause of blindness. While individuals frequently turn to the Internet for medical advice, distinguishing reliable information can be challenging. Large language models (LLMs) have attracted attention for generating accurate, human-like responses that may be used for medical consultation. However, a comprehensive assessment of LLMs' accuracy within specific medical domains is still lacking. METHODS We compiled 46 commonly inquired questions related to cataract care, categorized into six domains. Each question was presented to the LLMs, and three consultant-level ophthalmologists independently assessed the accuracy of their responses on a three-point scale (poor, borderline, good) and their comprehensiveness on a five-point scale. A majority consensus approach established the final rating for each response. Responses rated as 'Poor' were prompted for self-correction and reassessed. RESULTS For accuracy, ChatGPT-4o and Google Bard both achieved average sum scores of 8.7 (out of 9), followed by ChatGPT-3.5, Bing Chat, Llama 2, and Wenxin Yiyan. In consensus-based ratings, ChatGPT-4o outperformed Google Bard in the 'Good' rating. For completeness, ChatGPT-4o had the highest average sum score of 13.22 (out of 15), followed by Google Bard, ChatGPT-3.5, Llama 2, Bing Chat, and Wenxin Yiyan. Detailed performance data reveal nuanced differences in model capabilities. In the 'Prevention' domain, apart from Wenxin Yiyan, all other models were rated as 'Good'. All models showed improvement in self-correction. Bard and Bing improved 1/1 from 'Poor' to better, Llama improved 3/4, and Wenxin Yiyan improved 4/5. CONCLUSIONS Our findings emphasize the potential of LLMs, particularly ChatGPT-4o, to deliver accurate and comprehensive responses to cataract-related queries, especially in prevention, indicating potential for medical consultation. Continuous efforts to enhance LLMs' accuracy through ongoing strategies and evaluations are essential.
Collapse
Affiliation(s)
- Zichang Su
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
| | - Kai Jin
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China.
| | - Hongkang Wu
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
| | - Ziyao Luo
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China
- Zhejiang University Chu Kochen Honors College, Hangzhou, 310009, China
| | - Andrzej Grzybowski
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, Poznań, Poland
| | - Juan Ye
- Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China.
| |
Collapse
|
50
|
Zeljkovic I, Novak M, Jordan A, Lisicic A, Nemeth-Blažić T, Pavlovic N, Manola Š. Evaluating ChatGPT-4's correctness in patient-focused informing and awareness for atrial fibrillation. Heart Rhythm O2 2025; 6:58-63. [PMID: 40224268 PMCID: PMC11993680 DOI: 10.1016/j.hroo.2024.10.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/15/2025] Open
Abstract
Background As artificial intelligence and large language models continue to evolve, their application in health care is expanding. OpenAI's Chat Generative Pre-trained Transformer 4 (ChatGPT-4) represents the latest advancement in this technology, capable of engaging in complex dialogues and providing information. Objective This study explores the correctness of ChatGPT-4 in informing patients about atrial fibrillation. Methods This cross-sectional observational study involved ChatGPT-4 in responding to a structured set of 108 questions across 10 categories related to atrial fibrillation. These categories included basic information, treatment options, lifestyle adjustments, and more, reflecting common patient inquiries. The model's responses were evaluated by a panel of 3 cardiologists on the basis of accuracy, comprehensiveness, clarity, relevance to clinical practice, and patient safety. The total correctness of ChatGPT-4 was quantitatively assessed through scores assigned in each category, and statistical analysis was performed to identify significant differences in performance across categories. Results ChatGPT-4 provided correct and relevant answers with considerable variability across categories. It excelled in "Lifestyle Adjustments" and "Daily Life and Management" with perfect and near-perfect scores but struggled with "Miscellaneous Concerns" scoring lower. Statistical analysis confirmed significant differences in total scores across categories (P = .020). Conclusion Our results suggest that while ChatGPT-4 is reliable in categories with structured and direct queries, it shows limitations when handling complex medical queries that require in-depth explanations or clinical judgment. ChatGPT-4 demonstrates promising potential as a tool for patient-focused informing in atrial fibrillation, particularly in straightforward informing content.
Collapse
Affiliation(s)
- Ivan Zeljkovic
- Department of Cardiovascular Diseases, Dubrava University Hospital, Avenija Gojka Šuška, Zagreb, Croatia
- Catholic University of Croatia, Zagreb, Croatia
| | - Matea Novak
- Catholic University of Croatia, Zagreb, Croatia
- RIT Croatia, Rochester Institute of Technology, Zagreb, Croatia
| | - Ana Jordan
- Department of Cardiovascular Diseases, Dubrava University Hospital, Avenija Gojka Šuška, Zagreb, Croatia
| | - Ante Lisicic
- Department of Cardiovascular Diseases, Dubrava University Hospital, Avenija Gojka Šuška, Zagreb, Croatia
| | | | - Nikola Pavlovic
- Department of Cardiovascular Diseases, Dubrava University Hospital, Avenija Gojka Šuška, Zagreb, Croatia
| | - Šime Manola
- Department of Cardiovascular Diseases, Dubrava University Hospital, Avenija Gojka Šuška, Zagreb, Croatia
| |
Collapse
|