1
|
Guo S, Li G, Du W, Situ F, Li Z, Lei J. The performance of ChatGPT and ERNIE Bot in surgical resident examinations. Int J Med Inform 2025; 200:105906. [PMID: 40220627 DOI: 10.1016/j.ijmedinf.2025.105906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 03/10/2025] [Accepted: 03/30/2025] [Indexed: 04/14/2025]
Abstract
STUDY PURPOSE To assess the application of these two large language models (LLMs) for surgical resident examinations and to compare the performance of these LLMs with that of human residents. STUDY DESIGN In this study, 596 questions with a total of 183,556 responses were first included from the Medical Vision World, an authoritative medical education platform across China. Both Chinese prompted and non-prompted questions were input into ChatGPT-4.0 and ERNIE Bot-4.0 to compare their performance in a Chinese question database. Additionally, we screened another 210 surgical questions with detailed response results from 43 residents to compare the performance of residents and these two LLMs. RESULTS There were no significant differences in the correctness of the responses to the 596 questions with or without prompts between the two LLMs (ChatGPT-4.0: 68.96 % [without prompt], 71.14 % [with prompts], p = 0.411; ERNIE Bot-4.0: 78.36 % [without prompt], 78.86 % [with prompts], p = 0.832), but ERNIE Bot-4.0 displayed higher correctness than ChatGPT-4.0 did (with prompts: p = 0.002; without prompts: p < 0.001). For another 210 questions with prompts, the two LLMs, especially ERNIE Bot-4.0 (ranking in the top 95 % of the 43 residents' scores), significantly outperformed the residents. CONCLUSIONS The performance of ERNIE Bot-4.0 was superior to that of ChatGPT-4.0 and that of residents on surgical resident examinations in a Chinese question database.
Collapse
Affiliation(s)
- Siyin Guo
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China; The Laboratory of Thyroid and Parathyroid Disease, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.
| | - Genpeng Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China; The Laboratory of Thyroid and Parathyroid Disease, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.
| | - Wei Du
- Beijing Medical Vision Times Technology Development Company Limited, Beijing, China.
| | - Fangzhi Situ
- Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China.
| | - Zhihui Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China; The Laboratory of Thyroid and Parathyroid Disease, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.
| | - Jianyong Lei
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China; The Laboratory of Thyroid and Parathyroid Disease, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.
| |
Collapse
|
2
|
Gökcek Taraç M, Nale T. Artificial intelligence in pediatric dental trauma: do artificial intelligence chatbots address parental concerns effectively? BMC Oral Health 2025; 25:736. [PMID: 40382588 PMCID: PMC12085849 DOI: 10.1186/s12903-025-06105-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2025] [Accepted: 05/05/2025] [Indexed: 05/20/2025] Open
Abstract
BACKGROUND This study focused on two Artificial Intelligence chatbots, ChatGPT 3.5 and Google Gemini, as the primary tools for answering questions related to traumatic dental injuries. The aim of this study to evaluate the reliability, understandability, and applicability of the responses provided by these chatbots to commonly asked questions from parents of children with dental trauma. METHODS The case scenarios were developed based on frequently asked questions that parents commonly ask their dentists or Artificial Intelligence chatbots regarding dental trauma in children. The quality and accuracy of the information obtained from the chatbots were assessed using the DISCERN Instrument. The understandability and actionability of the responses obtained from the Artificial Intelligence chatbots were assessed using the Patient Education Materials Assessment Tool for Printed Materials. In statistical analysis; categorical variables were analyzed in terms of frequency and percentage. For numerical variables, skewness and kurtosis values were calculated to assess normal distribution. RESULTS Both Artificial Intelligence chatbots performed similarly, although Google Gemini provided higher quality and more reliable responses. Based on the mean scores, ChatGPT 3.5 had a higher understandability. Both chatbots demonstrated similar levels of performance in terms of actionability. CONCLUSION Artificial Intelligence applications can serve as a helpful starting point for parents seeking information and reassurance after dental trauma. However, they should not replace professional dental consultations, as their reliability is not absolute. Parents should use Artificial Intelligence applications as complementary resources and seek timely professional advice for accurate diagnosis and treatment.
Collapse
Affiliation(s)
- Mihriban Gökcek Taraç
- Department of Pediatric Dentistry, Karabuk University School of Dentistry, Karabük, Turkey.
| | - Tuğba Nale
- Antalya Oral and Dental Health Hospital, Antalya, Turkey
| |
Collapse
|
3
|
Khareedi R, Fernandez D. The Role of Chatbots in Enquiry-Based Learning for Oral Health Students-An Exploratory Study. EUROPEAN JOURNAL OF DENTAL EDUCATION : OFFICIAL JOURNAL OF THE ASSOCIATION FOR DENTAL EDUCATION IN EUROPE 2025. [PMID: 40372926 DOI: 10.1111/eje.13115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 03/05/2025] [Accepted: 04/27/2025] [Indexed: 05/17/2025]
Abstract
OBJECTIVE This study explores the reliability of four Chatbots in enquiry-based learning. Four Chatbots, namely Microsoft Copilot, Google Gemini, ChatGPT 3.5 and Perplexity, were used to answer and generate questions in four specific subject areas. METHODS The four Chatbots were subjected to questions at three cognitive levels and to generate questions based on specific contexts. The responses generated were assessed by two oral health academics for accuracy and appropriateness. RESULTS The findings indicated that ChatGPT3.5 generated the best self-assessment questions while Microsoft Copilot generated the best answers to questions. The performance of the Chatbots varied based on the subject on which the question was based and on the cognitive level of the question. While the questions at cognitive level one were answered most appropriately, the overall depth of responses to periodontology questions at cognitive level two was lower than for questions on dental materials, restorative dentistry and oral biology. CONCLUSION The potential role of Chatbots in enquiry-based learning is evident to some extent, but they currently do not have the proficiency that a human teacher has.
Collapse
Affiliation(s)
- Rohini Khareedi
- Department of Oral Health, School of Clinical Sciences, Auckland University of Technology, Auckland, New Zealand
| | - Daniel Fernandez
- Department of Oral Health, School of Clinical Sciences, Auckland University of Technology, Auckland, New Zealand
| |
Collapse
|
4
|
Gunes YC, Cesur T. The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study. J Thorac Imaging 2025; 40:e0805. [PMID: 39269227 DOI: 10.1097/rti.0000000000000805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2024]
Abstract
PURPOSE To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology. MATERIALS AND METHODS We collected publicly available 124 "Case of the Month" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ 2 , Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests. RESULTS Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs ( P <0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 ( P <0.05). All LLMs and radiologists showed greater accuracy in specific cases ( P <0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity ( P >0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups ( P >0.05), except for Meta Llama 3 70b in the vascular cases ( P =0.040). CONCLUSIONS Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.
Collapse
Affiliation(s)
- Yasin Celal Gunes
- Department of Radiology, Kirikkale Yuksek Ihtisas Hospital, Kirikkale
| | - Turay Cesur
- Department of Radiology, Mamak State Hospital, Ankara, Türkiye
| |
Collapse
|
5
|
Rao A, Mu A, Enichen E, Gupta D, Hall N, Koranteng E, Marks W, Senter-Zapata MJ, Whitehead DC, White BA, Saini S, Landman AB, Succi MD. A Future of Self-Directed Patient Internet Research: Large Language Model-Based Tools Versus Standard Search Engines. Ann Biomed Eng 2025; 53:1199-1208. [PMID: 40025252 PMCID: PMC12123582 DOI: 10.1007/s10439-025-03701-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 02/22/2025] [Indexed: 03/04/2025]
Abstract
PURPOSE As generalist large language models (LLMs) become more commonplace, patients will inevitably increasingly turn to these tools instead of traditional search engines. Here, we evaluate publicly available LLM-based chatbots as tools for patient education through physician review of responses provided by Google, Bard, GPT-3.5 and GPT-4 to commonly searched queries about prevalent chronic health conditions in the United States. METHODS Five distinct commonly Google-searched queries were selected for (i) hypertension, (ii) hyperlipidemia, (iii) diabetes, (iv) anxiety, and (v) mood disorders and prompted into each model of interest. Responses were assessed by board-certified physicians for accuracy, comprehensiveness, and overall quality on a five-point Likert scale. The Flesch-Kincaid Grade Levels were calculated to assess readability. RESULTS GPT-3.5 (4.40 ± 0.48, 4.29 ± 0.43) and GPT-4 (4.35 ± 0.30, 4.24 ± 0.28) received higher ratings in comprehensiveness and quality than Bard (3.79 ± 0.36, 3.87 ± 0.32) and Google (1.87 ± 0.42, 2.11 ± 0.47), all p < 0.05. However, Bard (9.45 ± 1.35) and Google responses (9.92 ± 5.31) had a lower average Flesch-Kincaid Grade Level compared to GPT-3.5 (14.69 ± 1.57) and GPT-4 (12.88 ± 2.02), indicating greater readability. CONCLUSION This study suggests that publicly available LLM-based tools may provide patients with more accurate responses to queries on chronic health conditions than answers provided by Google search. These results provide support for the use of these tools in place of traditional search engines for health-related queries.
Collapse
Affiliation(s)
- Arya Rao
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Andrew Mu
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Elizabeth Enichen
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Dhruva Gupta
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Nathan Hall
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
| | - Erica Koranteng
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - William Marks
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Harvard Business School, Boston, MA, USA
| | - Michael J Senter-Zapata
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
- Mass General Brigham, Boston, MA, USA
| | - David C Whitehead
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Benjamin A White
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Sanjay Saini
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Adam B Landman
- Harvard Medical School, Boston, MA, USA
- Mass General Brigham, Boston, MA, USA
| | - Marc D Succi
- Harvard Medical School, Boston, MA, USA.
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA.
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA.
- Mass General Brigham, Boston, MA, USA.
| |
Collapse
|
6
|
Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. J Med Internet Res 2025; 27:e64486. [PMID: 40305085 PMCID: PMC12079073 DOI: 10.2196/64486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 02/04/2025] [Accepted: 04/03/2025] [Indexed: 05/02/2025] Open
Abstract
BACKGROUND Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field. More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent. OBJECTIVE This study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field. METHODS In this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024. Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports. The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. The NMA was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs were performed using a grading scale. A larger surface under the cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy. RESULTS The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases. Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087). ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis. Gemini (SUCRA=0.9649) had the highest rated SUCRA value for accuracy in the area of triage and classification. CONCLUSIONS Our study indicates that ChatGPT-4o has an advantage when answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate at the top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios. TRIAL REGISTRATION PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245.
Collapse
Affiliation(s)
- Ling Wang
- Fuzhou University Affiliated Provincial Hospital, Shengli Clinical Medical College, Fujian Medical University, Fuzhou, China
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Jinglin Li
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Boyang Zhuang
- Fujian Center For Drug Evaluation and Monitoring, Fuzhou, China
| | - Shasha Huang
- School of Pharmacy, Fujian University of Traditional Chinese Medicine, Fuzhou, China
| | - Meilin Fang
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Cunze Wang
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Wen Li
- Fuzhou University Affiliated Provincial Hospital, Shengli Clinical Medical College, Fujian Medical University, Fuzhou, China
| | - Mohan Zhang
- School of Pharmacy, Fujian Medical University, Fuzhou, China
| | - Shurong Gong
- The Third Department of Critical Care Medicine, Fuzhou University Affiliated Provincial Hospital, Shengli Clinical Medical College, Fujian Medical University, Fuzhou, Fujian, China
| |
Collapse
|
7
|
Du Y, Ji C, Xu J, Wei M, Ren Y, Xia S, Zhou J. Performance of ChatGPT and Microsoft Copilot in Bing in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports. Sci Rep 2025; 15:14627. [PMID: 40287483 PMCID: PMC12033324 DOI: 10.1038/s41598-025-99268-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Accepted: 04/18/2025] [Indexed: 04/29/2025] Open
Abstract
To evaluate and compare the performance of publicly available ChatGPT-3.5, ChatGPT-4.0 and Microsoft Copilot in Bing (Copilot) in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports. Twenty questions related to obstetric ultrasound were answered and 110 obstetric ultrasound reports were analyzed by ChatGPT-3.5, ChatGPT-4.0 and Copilot, with each question and report being posed three times to them at different times. The accuracy and consistency of each response to twenty questions and each analysis result in the report were evaluated and compared. In answering twenty questions, both ChatGPT-3.5 and ChatGPT-4.0 outperformed Copilot in accuracy (95.0% vs. 80.0%) and consistency (90.0% and 85.0% vs. 75.0%). However, no statistical difference was found among them. When analyzing obstetric ultrasound reports, ChatGPT-3.5 and ChatGPT-4.0 demonstrated superior accuracy compared to Copilot (P < 0.05), and all three showed high consistency and the ability to provide recommendations. The overall accuracy and consistency of ChatGPT-3.5, ChatGPT-4.0, and Copilot were 83.86%, 84.13% vs. 77.51% in accuracy, and 87.30%, 93.65% vs. 90.48% in consistency, respectively. These large language models (ChatGPT-3.5, ChatGPT-4.0 and Copilot) have the potential to assist clinical workflows by enhancing patient education and patient clinical communication around common obstetric ultrasound issues. With inconsistent and sometimes inaccurate responses, along with cybersecurity concerns, physician supervision is crucial in the use of these models.
Collapse
Affiliation(s)
- Yanran Du
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China
| | - Chao Ji
- Department of Pediatrics, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China
| | - Jiale Xu
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China
| | - Minyan Wei
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China
| | - Yunyun Ren
- Obstetrics and Gynecology Hospital of Fudan University, No.128, Shenyang Road, Shanghai, 200090, China.
| | - Shujun Xia
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China.
| | - JianQiao Zhou
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China.
| |
Collapse
|
8
|
Chen D, Avison K, Alnassar S, Huang RS, Raman S. Medical accuracy of artificial intelligence chatbots in oncology: a scoping review. Oncologist 2025; 30:oyaf038. [PMID: 40285677 PMCID: PMC12032582 DOI: 10.1093/oncolo/oyaf038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Accepted: 01/03/2025] [Indexed: 04/29/2025] Open
Abstract
BACKGROUND Recent advances in large language models (LLM) have enabled human-like qualities of natural language competency. Applied to oncology, LLMs have been proposed to serve as an information resource and interpret vast amounts of data as a clinical decision-support tool to improve clinical outcomes. OBJECTIVE This review aims to describe the current status of medical accuracy of oncology-related LLM applications and research trends for further areas of investigation. METHODS A scoping literature search was conducted on Ovid Medline for peer-reviewed studies published since 2000. We included primary research studies that evaluated the medical accuracy of a large language model applied in oncology settings. Study characteristics and primary outcomes of included studies were extracted to describe the landscape of oncology-related LLMs. RESULTS Sixty studies were included based on the inclusion and exclusion criteria. The majority of studies evaluated LLMs in oncology as a health information resource in question-answer style examinations (48%), followed by diagnosis (20%) and management (17%). The number of studies that evaluated the utility of fine-tuning and prompt-engineering LLMs increased over time from 2022 to 2024. Studies reported the advantages of LLMs as an accurate information resource, reduction of clinician workload, and improved accessibility and readability of clinical information, while noting disadvantages such as poor reliability, hallucinations, and need for clinician oversight. DISCUSSION There exists significant interest in the application of LLMs in clinical oncology, with a particular focus as a medical information resource and clinical decision support tool. However, further research is needed to validate these tools in external hold-out datasets for generalizability and to improve medical accuracy across diverse clinical scenarios, underscoring the need for clinician supervision of these tools.
Collapse
Affiliation(s)
- David Chen
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
| | - Kate Avison
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Saif Alnassar
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Ryan S Huang
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
| | - Srinivas Raman
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON M5T 1P5, Canada
- Department of Radiation Oncology, BC Cancer, Vancouver, BC V5Z 1G1, Canada
- Division of Radiation Oncology, University of British Columbia, Vancouver, BC V5Z 1M9, Canada
| |
Collapse
|
9
|
Portilla ND, Garcia-Font M, Nagendrababu V, Abbott PV, Sanchez JAG, Abella F. Accuracy and Consistency of Gemini Responses Regarding the Management of Traumatized Permanent Teeth. Dent Traumatol 2025; 41:171-177. [PMID: 39460511 DOI: 10.1111/edt.13004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Revised: 09/27/2024] [Accepted: 09/29/2024] [Indexed: 10/28/2024]
Abstract
BACKGROUND The aim of this cross-sectional observational analytical study was to assess the accuracy and consistency of responses provided by Google Gemini (GG), a free-access high-performance multimodal large language model, to questions related to the European Society of Endodontology position statement on the management of traumatized permanent teeth (MTPT). MATERIALS AND METHODS Three academic endodontists developed a set of 99 yes/no questions covering all areas of the MTPT. Nine general dentists and 22 endodontic specialists evaluated these questions for clarity and comprehension through an iterative process. Two academic dental trauma experts categorized the knowledge required to answer each question into three levels. The three academic endodontists submitted the 99 questions to the GG, resulting in 297 responses, which were then assessed for accuracy and consistency. Accuracy was evaluated using the Wald binomial method, while the consistency of GG responses was assessed using the kappa-Fleiss coefficient with a confidence interval of 95%. A 5% significance level chi-squared test was used to evaluate the influence of question level of knowledge on accuracy and consistency. RESULTS The responses generated by Gemini showed an overall moderate accuracy of 80.81%, with no significant differences found between the responses of the academic endodontists. Overall, high consistency (95.96%) was demonstrated, with no significant differences between GG responses across the three accounts. The analysis also revealed no correlation between question level of knowledge and accuracy or consistency, with no significant differences. CONCLUSIONS The results of this study could significantly impact the potential use of Gemini as a free-access source of information for clinicians in the MTPT.
Collapse
Affiliation(s)
- Nicolas Dufey Portilla
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
- Department of Endodontics, School of Dentistry, Universidad Andres Bello, Viña del Mar, Chile
| | - Marc Garcia-Font
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
| | - Venkateshbabu Nagendrababu
- Department of Preventive and Restorative Dentistry, College of Dental Medicine, University of Sharjah, Sharjah, UAE
| | - Paul V Abbott
- UWA Dental School, The University of Western Australia, Perth, Western Australia, Australia
| | | | - Francesc Abella
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
| |
Collapse
|
10
|
Halfmann MC, Mildenberger P, Jorg T. [Artificial intelligence in radiology : Literature overview and reading recommendations]. RADIOLOGIE (HEIDELBERG, GERMANY) 2025; 65:266-270. [PMID: 39904811 DOI: 10.1007/s00117-025-01419-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 01/13/2025] [Indexed: 02/06/2025]
Abstract
BACKGROUND Due to the ongoing rapid advancement of artificial intelligence (AI), including large language models (LLMs), radiologists will soon face the challenge of the responsible clinical integration of these models. OBJECTIVES The aim of this work is to provide an overview of current developments regarding LLMs, potential applications in radiology, and their (future) relevance and limitations. MATERIALS AND METHODS This review analyzes publications on LLMs for specific applications in medicine and radiology. Additionally, literature related to the challenges of clinical LLM use was reviewed and summarized. RESULTS In addition to a general overview of current literature on radiological applications of LLMs, several particularly noteworthy studies on the subject are recommended. CONCLUSIONS In order to facilitate the forthcoming clinical integration of LLMs, radiologists need to engage with the topic, understand various application areas, and be aware of potential limitations in order to address challenges related to patient safety, ethics, and data protection.
Collapse
Affiliation(s)
- Moritz C Halfmann
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland
| | - Peter Mildenberger
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland
| | - Tobias Jorg
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland.
| |
Collapse
|
11
|
Zhou Z, Qin P, Cheng X, Shao M, Ren Z, Zhao Y, Li Q, Liu L. ChatGPT in Oncology Diagnosis and Treatment: Applications, Legal and Ethical Challenges. Curr Oncol Rep 2025; 27:336-354. [PMID: 39998782 DOI: 10.1007/s11912-025-01649-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/01/2025] [Indexed: 02/27/2025]
Abstract
PURPOSE OF REVIEW This study aims to systematically review the trajectory of artificial intelligence (AI) development in the medical field, with a particular emphasis on ChatGPT, a cutting-edge tool that is transforming oncology's diagnosis and treatment practices. RECENT FINDINGS Recent advancements have demonstrated that ChatGPT can be effectively utilized in various areas, including collecting medical histories, conducting radiological & pathological diagnoses, generating electronic medical record (EMR), providing nutritional support, participating in Multidisciplinary Team (MDT) and formulating personalized, multidisciplinary treatment plans. However, some significant challenges related to data privacy and legal issues that need to be addressed for the safe and effective integration of ChatGPT into clinical practice. ChatGPT, an emerging AI technology, opens up new avenues and viewpoints for oncology diagnosis and treatment. If current technological and legal challenges can be overcome, ChatGPT is expected to play a more significant role in oncology diagnosis and treatment in the future, providing better treatment options and improving the quality of medical services.
Collapse
Affiliation(s)
- Zihan Zhou
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Peng Qin
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Xi Cheng
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Maoxuan Shao
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Zhaozheng Ren
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Yiting Zhao
- Stomatological College of Nanjing Medical University, Nanjing, 211166, China
| | - Qiunuo Li
- The First Clinical Medical College of Nanjing Medical University, Nanjing, 211166, China
| | - Lingxiang Liu
- Department of Oncology, The First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Nanjing, 210029, Jiangsu, China.
| |
Collapse
|
12
|
Carlà MM, Crincoli E, Rizzo S. RETINAL IMAGING ANALYSIS PERFORMED BY CHATGPT-4o AND GEMINI ADVANCED: The Turning Point of the Revolution? Retina 2025; 45:694-702. [PMID: 39715322 DOI: 10.1097/iae.0000000000004351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2024]
Abstract
PURPOSE To assess the diagnostic capabilities of the most recent chatbots releases, GPT-4o and Gemini Advanced, facing different retinal diseases. METHODS Exploratory analysis on 50 cases with different surgical (n = 27) and medical (n = 23) retinal pathologies, whose optical coherence tomography/angiography scans were dragged into ChatGPT and Gemini's interfaces. Then, the authors asked "Please describe this image" and classified the diagnosis as: 1) Correct; 2) Partially correct; 3) Wrong; 4) Unable to assess exam type; and 5) Diagnosis not given. RESULTS ChatGPT indicated the correct diagnosis in 31 of 50 cases (62%), significantly higher than Gemini Advanced in 16 of 50 cases ( P = 0.0048). In 24% of cases, Gemini Advanced was not able to produce any answer, stating "That's not something I'm able to do yet." For both, primary misdiagnosis was macular edema, given erroneously in 16% and 14% of cases, respectively. ChatGPT-4o showed higher rates of correct diagnoses either in surgical (52% vs. 30%) or in medical retina (78% vs. 43%). Notably, when presented without the corresponding structural image, in any case Gemini was able to recognize optical coherence tomography angiography scans, confusing images for artworks. CONCLUSION ChatGPT-4o outperformed Gemini Advanced in diagnostic accuracy facing optical coherence tomography/angiography images, even if the range of diagnoses is still limited.
Collapse
Affiliation(s)
- Matteo Mario Carlà
- Ophthalmology Department, "Fondazione Policlinico Universitario A. Gemelli, IRCCS", Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy ; and
| | - Emanuele Crincoli
- Ophthalmology Department, "Fondazione Policlinico Universitario A. Gemelli, IRCCS", Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy ; and
| | - Stanislao Rizzo
- Ophthalmology Department, "Fondazione Policlinico Universitario A. Gemelli, IRCCS", Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy ; and
- Consiglio Nazionale Delle Ricerche, Istituto di Neuroscienze, Pisa, Italy
| |
Collapse
|
13
|
Mohamed KS, Yu A, Schroen CA, Duey A, Hong J, Yu R, Etigunta S, Kator J, Rhee HS, Hausman MR. Comparing AAOS appropriate use criteria with ChatGPT-4o recommendations on treating distal radius fractures. HAND SURGERY & REHABILITATION 2025; 44:102122. [PMID: 40081807 DOI: 10.1016/j.hansur.2025.102122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/24/2025] [Accepted: 02/26/2025] [Indexed: 03/16/2025]
Abstract
INTRODUCTION The American Academy of Orthopaedic Surgeons (AAOS) developed appropriate use criteria (AUC) to guide treatment decisions for distal radius fractures based on expert consensus. This study aims to evaluate the accuracy of Chat Generative Pre-trained Transformer-4o (ChatGPT-4o) by comparing its appropriateness scores for distal radius fracture treatment with those from the AUC. METHODS The AUC patient scenarios were categorized by factors such as fracture type (AO/OTA classification), mechanism of injury, pre-injury activity level, patient health (ASA 1-4), and associated injuries. Treatment options included percutaneous pinning, spanning external fixation, volar locking plates, dorsal plates, and immobilization methods, among others. Orthopedic surgeons assigned appropriateness scores for each treatment (1-3 = "Rarely Appropriate," 4-6 = "May Be Appropriate," and 7-9 = "Appropriate"). ChatGPT-4o was prompted with the same patient scenarios and asked to assign scores. Differences between AAOS and ChatGPT-4o ratings were used to calculate mean error, mean absolute error, and mean squared error. Statistical significance was assessed using Spearman correlation, and appropriateness scores were grouped into categories to determine percentage overlap between the two sources. RESULTS A total of 240 patient scenarios and 2160 paired treatment scores were analyzed. The mean error for treatment options ranged from 0.6 for volar locking plate to -2.9 for dorsal plating. Pearson correlation revealed significant positive associations for dorsal spanning bridge (0.43, P = <0.001) and spanning external fixation (0.4, P = <0.001). The percentage overlap between AAOS and ChatGPT-4o in the appropriateness categories varied, with 99.17% agreement for immobilization without reduction, 90.42% for volar locking plates, and only 15% for dorsal plating. CONCLUSION ChatGPT-4o does not consistently align with the appropriate use criteria in determining appropriate management of distal radius fractures. While there was moderate concordance in certain treatments, ChatGPT-4o tended to favor more conservative approaches, raising concerns about the reliability of AI-generated recommendations for medical advice and clinical decision-making.
Collapse
Affiliation(s)
- Kareem S Mohamed
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Alexander Yu
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Christoph A Schroen
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States; Department of Hand-, Plastic and Reconstructive Surgery, BG Trauma Center Ludwigshafen, Heidelberg University, Heidelberg, Germany.
| | - Akiro Duey
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - James Hong
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Ryan Yu
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Suhas Etigunta
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Jamie Kator
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Hannah S Rhee
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Michael R Hausman
- Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
14
|
Knee CJ, Campbell RJ, Sivakumar BS, Symes MJ. An Assessment of the Accuracy and Consistency of ChatGPT in the Management of Midshaft Clavicle Fractures. Cureus 2025; 17:e81906. [PMID: 40342470 PMCID: PMC12059606 DOI: 10.7759/cureus.81906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/08/2025] [Indexed: 05/11/2025] Open
Abstract
Background Midshaft clavicle fractures are common orthopaedic injuries with no consensus on optimal management. Large language models (LLMs) such as ChatGPT (OpenAI, San Francisco, USA) present a novel tool for patient education and clinical decision-making. This study aimed to evaluate the accuracy and consistency of ChatGPT's responses to patient-focused and clinical decision-making questions regarding this injury. Methods ChatGPT-4o mini was prompted three times with 14 patient-focused and orthopaedic clinical decision-making questions. References were requested for each response. Response accuracy was graded as: (I) comprehensive; (II) correct but inadequate; (III) mixed with correct and incorrect information; or (IV) completely incorrect. Two consultant and two trainee orthopaedic surgeons evaluated the accuracy and consistency of responses. References provided by ChatGPT were evaluated for accuracy. Results All 42 responses were graded as (III), indicating a mix of correct and incorrect information, with 78.6% consistency across the responses. Of the 128 references provided, 0.8% were correct, 10.9% were incorrect, and 88.3% were fabricated. Only 3.1% of references accurately reflected the cited conclusions. Conclusion ChatGPT demonstrates limitations in accuracy and consistency when answering patient-focused queries or aiding in orthopaedic clinical decision-making for midshaft clavicle fractures. Caution is advised before integrating ChatGPT into clinical workflows for patients or orthopaedic clinicians.
Collapse
Affiliation(s)
- Christopha J Knee
- Department of Orthopaedics and Trauma Surgery, Royal North Shore Hospital, Sydney, AUS
| | - Ryan J Campbell
- Department of Orthopaedics and Trauma Surgery, Royal North Shore Hospital, Sydney, AUS
| | - Brahman S Sivakumar
- Department of Hand and Peripheral Nerve Surgery, Royal North Shore Hospital, Sydney, AUS
| | - Michael J Symes
- Department of Orthopaedics and Trauma Surgery, Royal North Shore Hospital, Sydney, AUS
| |
Collapse
|
15
|
Tanaka OM, Gasparello GG, Mota-Júnior SL, Bark MJ, Rozyscki JDAA, Wolanski RB. Effectiveness of AI-generated orthodontic treatment plans compared to expert orthodontist recommendations: a cross-sectional pilot study. Dental Press J Orthod 2025; 30:e2524186. [PMID: 40136111 PMCID: PMC11939423 DOI: 10.1590/2177-6709.30.1.e2524186.oar] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2024] [Accepted: 10/10/2024] [Indexed: 03/27/2025] Open
Abstract
INTRODUCTION Artificial intelligence (AI) has become a prominent focus in orthodontics. OBJECTIVE This study aimed to compare treatment plans generated by AI platforms (ChatGPT, Google Bard, Microsoft Bing) with those formulated by an experienced orthodontist. METHODS This observational cross-sectional pilot study aims to evaluate the effectiveness of AI-powered platforms in creating orthodontic treatment plans, using a clinical case treated by an experienced orthodontist as a benchmark. A clinical case was selected, and after obtaining informed consent, detailed case information was presented to ChatGPT-3.5, Microsoft Bing Copilot, and Google Bard Gemini for treatment planning. The AI-generated plans, along with the orthodontist's plan, were evaluated by 34 orthodontists using a questionnaire that included Likert scale and Visual Analog Scale (VAS) items. Statistical analysis was performed to compare the levels of agreement with the proposed treatment plans. RESULTS Orthodontists exhibited significantly higher levels of agreement with treatment plans proposed by the orthodontist, compared to those generated by AIs platforms (p < 0.001). Both Likert scale and VAS scores indicated increased confidence in the orthodontist's expertise in formulating treatment plans. No significant differences were found among the AI platforms, though Google Bard received the lowest mean scores. CONCLUSIONS Orthodontists demonstrated a higher level of acceptance of treatment plans formulated by human counterparts over those generated by AI platforms. While AI offers significant contributions, the clinical judgment and experience of orthodontists remain essential for thorough and effective treatment planning in orthodontics.
Collapse
Affiliation(s)
- Orlando Motohiro Tanaka
- Center for Advanced Dental Education at Saint Louis University (Saint Louis, USA)
- Pontifícia Universidade Católica do Paraná, Medicine and Life Science School (Curitiba/PR, Brazil)
| | - Gil Guilherme Gasparello
- Pontifícia Universidade Católica do Paraná, Medicine and Life Science School (Curitiba/PR, Brazil)
| | | | - Mohamad Jamal Bark
- Pontifícia Universidade Católica do Paraná, Medicine and Life Science School (Curitiba/PR, Brazil)
| | | | - Rafael Bordin Wolanski
- Pontifícia Universidade Católica do Paraná, Medicine and Life Science School (Curitiba/PR, Brazil)
| |
Collapse
|
16
|
Gumilar KE, Wardhana MP, Akbar MIA, Putra AS, Banjarnahor DPP, Mulyana RS, Fatati I, Yu ZY, Hsu YC, Dachlan EG, Lu CH, Liao LN, Tan M. Artificial intelligence-large language models (AI-LLMs) for reliable and accurate cardiotocography (CTG) interpretation in obstetric practice. Comput Struct Biotechnol J 2025; 27:1140-1147. [PMID: 40206348 PMCID: PMC11981782 DOI: 10.1016/j.csbj.2025.03.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 03/05/2025] [Accepted: 03/13/2025] [Indexed: 04/11/2025] Open
Abstract
Background Accurate cardiotocography (CTG) interpretation is vital for the monitoring of fetal well-being during pregnancy and labor. Advanced artificial intelligence (AI) tools such as AI-large language models (AI-LLMs) may enhance the accuracy of CTG interpretation, but their potential has not been extensively evaluated. Objective This study aimed to assess the performance of three AI-LLMs (ChatGPT-4o, Gemini Advanced, and Copilot) in CTG image interpretation, compare their results to those of junior (JHDs) and senior human doctors (SHDs), and evaluate their reliability in clinical decision-making. Study design Seven CTG images were interpreted by the three AI-LLMs, five SHDs, and five JHDs, with the evaluations scored by five blinded maternal-fetal medicine experts using a Likert scale for five parameters (relevance, clarity, depth, focus, and coherence). The homogeneity of the expert ratings and group performances were statistically compared. Results ChatGPT-4o scored 77.86, outperforming the Gemini Advanced (57.14), Copilot (47.29), and JHDs (61.57). Its performance closely approached that of the SHDs (80.43), with no statistically significant difference between the two (p > 0.05). ChatGPT-4o excelled in the depth parameter and was only marginally inferior to the SHDs regarding the other parameters. Conclusion ChatGPT-4o demonstrated superior performance among the AI-LLMs, surpassed JHDs in CTG interpretation, and closely matched the performance level of SHDs. AI-LLMs, particularly ChatGPT-4o, are promising tools for assisting obstetricians, improving diagnostic accuracy, and enhancing obstetric patient care.
Collapse
Affiliation(s)
- Khanisyah Erza Gumilar
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
- Department of Obstetrics and Gynecology, Universitas Airlangga Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Manggala Pasca Wardhana
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Muhammad Ilham Aldika Akbar
- Department of Obstetrics and Gynecology, Universitas Airlangga Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Agung Sunarko Putra
- Department of Obstetrics and Gynecology, Dr. Ramelan Naval Hospital, Surabaya, Indonesia
| | | | - Ryan Saktika Mulyana
- Department of Obstetrics and Gynecology, Department of Obstetrics and Gynecology, Udayana University Hospital, Denpasar, Indonesia
| | - Ita Fatati
- Department of Obstetrics and Gynecology, Bandung Kiwari General Hospital, Bandung, Indonesia
| | - Zih-Ying Yu
- Department of Public Health, China Medical University, Taichung, Taiwan
| | - Yu-Cheng Hsu
- Department of Public Health, China Medical University, Taichung, Taiwan
- School of Chinese Medicine, China Medical University, Taichung, Taiwan
| | - Erry Gumilar Dachlan
- Department of Obstetrics and Gynecology, Universitas Airlangga Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Chien-Hsing Lu
- Department of Obstetrics and Gynecology, Taichung Veteran General Hospital, Taichung, Taiwan
| | - Li-Na Liao
- Department of Public Health, China Medical University, Taichung, Taiwan
| | - Ming Tan
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
- Institute of Biochemistry and Molecular Biology and Research Center for Cancer Biology, China Medical University, Taichung, Taiwan
| |
Collapse
|
17
|
Menz BD, Modi ND, Abuhelwa AY, Ruanglertboon W, Vitry A, Gao Y, Li LX, Chhetri R, Chu B, Bacchi S, Kichenadasse G, Shahnam A, Rowland A, Sorich MJ, Hopkins AM. Generative AI chatbots for reliable cancer information: Evaluating web-search, multilingual, and reference capabilities of emerging large language models. Eur J Cancer 2025; 218:115274. [PMID: 39922126 DOI: 10.1016/j.ejca.2025.115274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2025] [Revised: 01/22/2025] [Accepted: 01/24/2025] [Indexed: 02/10/2025]
Abstract
Recent advancements in large language models (LLMs) enable real-time web search, improved referencing, and multilingual support, yet ensuring they provide safe health information remains crucial. This perspective evaluates seven publicly accessible LLMs-ChatGPT, Co-Pilot, Gemini, MetaAI, Claude, Grok, Perplexity-on three simple cancer-related queries across eight languages (336 responses: English, French, Chinese, Thai, Hindi, Nepali, Vietnamese, and Arabic). None of the 42 English responses contained clinically meaningful hallucinations, whereas 7 of 294 non-English responses did. 48 % (162/336) of responses included valid references, but 39 % of the English references were.com links reflecting quality concerns. English responses frequently exceeded an eighth-grade level, and many non-English outputs were also complex. These findings reflect substantial progress over the past 2-years but reveal persistent gaps in multilingual accuracy, reliable reference inclusion, referral practices, and readability. Ongoing benchmarking is essential to ensure LLMs safely support global health information dichotomy and meet online information standards.
Collapse
Affiliation(s)
- Bradley D Menz
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Natansh D Modi
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Ahmad Y Abuhelwa
- Department of Pharmacy Practice and Pharmacotherapeutics, College of Pharmacy, University of Sharjah, Sharjah, United Arab Emirates
| | - Warit Ruanglertboon
- Division of Health and Applied Sciences, Prince of Songkla University, Songkhla, Thailand; Research Center in Mathematics and Statistics with Applications, Discipline of Statistics, Division of Computational Science, Faculty of Science, Prince of Songkla University, Songkhla, Thailand
| | - Agnes Vitry
- University of South Australia, Clinical and Health Sciences, Adelaide, Australia
| | - Yuan Gao
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Lee X Li
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Rakchha Chhetri
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Bianca Chu
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Stephen Bacchi
- Department of Neurology and the Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02138, USA
| | - Ganessan Kichenadasse
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia; Flinders Centre for Innovation in Cancer, Department of Medical Oncology, Flinders Medical Centre, Flinders University, Bedford Park, South Australia, Australia
| | - Adel Shahnam
- Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, Australia
| | - Andrew Rowland
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Michael J Sorich
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Ashley M Hopkins
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia.
| |
Collapse
|
18
|
Mavrych V, Ganguly P, Bolgova O. Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis. Clin Anat 2025; 38:200-210. [PMID: 39573871 DOI: 10.1002/ca.24244] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 10/24/2024] [Accepted: 11/04/2024] [Indexed: 04/27/2025]
Abstract
The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% ± 1.9% of questions accurately, then Copilot (42.0% ± 0.0%) and ChatGPT-3.5 (41.0% ± 5.3%), followed by ChatGPT-3.5-turbo (38.5% ± 5.7%). Google PaLM 2 (34.5% ± 4.4%) and Bard (33.5% ± 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.
Collapse
Affiliation(s)
- Volodymyr Mavrych
- College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia
| | - Paul Ganguly
- College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia
| | - Olena Bolgova
- College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia
| |
Collapse
|
19
|
Lee KL, Kessler DA, Caglic I, Kuo YH, Shaida N, Barrett T. Assessing the performance of ChatGPT and Bard/Gemini against radiologists for Prostate Imaging-Reporting and Data System classification based on prostate multiparametric MRI text reports. Br J Radiol 2025; 98:368-374. [PMID: 39535870 PMCID: PMC11840166 DOI: 10.1093/bjr/tqae236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 10/10/2024] [Accepted: 11/10/2024] [Indexed: 11/16/2024] Open
Abstract
OBJECTIVES Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign Prostate Imaging-Reporting and Data System (PI-RADS) categories based on clinical text reports. METHODS One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by 2 uroradiologists, ChatGPT-3.5 (GPT-3.5), ChatGPT-4o mini (GPT-4), Bard, and Gemini. Original report classifications were considered definitive. RESULTS Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 "hallucination" for 2 patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively. CONCLUSIONS Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors. ADVANCES IN KNOWLEDGE This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.
Collapse
Affiliation(s)
- Kang-Lung Lee
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
- Department of Radiology, Taipei Veterans General Hospital, Taipei 112, Taiwan
- School of Medicine, National Yang Ming Chiao Tung University, Taipei 112, Taiwan
| | - Dimitri A Kessler
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
| | - Iztok Caglic
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
| | - Yi-Hsin Kuo
- Department of Radiology, Taipei Veterans General Hospital, Taipei 112, Taiwan
| | - Nadeem Shaida
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
| | - Tristan Barrett
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
| |
Collapse
|
20
|
Liu W, Wei H, Xiang L, Liu Y, Wang C, Hua Z. Bridging the Gap in Neonatal Care: Evaluating AI Chatbots for Chronic Neonatal Lung Disease and Home Oxygen Therapy Management. Pediatr Pulmonol 2025; 60:e71020. [PMID: 40042139 DOI: 10.1002/ppul.71020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 01/27/2025] [Accepted: 02/19/2025] [Indexed: 05/12/2025]
Abstract
OBJECTIVE To evaluate the accuracy and comprehensiveness of eight free, publicly available large language model (LLM) chatbots in addressing common questions related to chronic neonatal lung disease (CNLD) and home oxygen therapy (HOT). STUDY DESIGN Twenty CNLD and HOT-related questions were curated across nine domains. Responses from ChatGPT-3.5, Google Bard, Bing Chat, Claude 3.5 Sonnet, ERNIE Bot 3.5, and GLM-4 were generated and evaluated by three experienced neonatologists using Likert scales for accuracy and comprehensiveness. Updated LLM models (ChatGPT-4o mini and Gemini 2.0 Flash Experimental) were incorporated to assess rapid technological advancement. Statistical analyses included ANOVA, Kruskal-Wallis tests, and intraclass correlation coefficients. RESULTS Bing Chat and Claude 3.5 Sonnet demonstrated superior performance, with the highest mean accuracy scores (5.78 ± 0.48 and 5.75 ± 0.54, respectively) and competence scores (2.65 ± 0.58 and 2.80 ± 0.41, respectively). In subsequent testing, Gemini 2.0 Flash Experimental and ChatGPT-4o mini achieved comparable high performance. Performance varied across domains, with all models excelling in "equipment and safety protocols" and "caregiver support." ERNIE Bot 3.5 and GLM-4 showed self-correction capabilities when prompted. CONCLUSIONS LLMs promise accurate CNLD/HOT information. However, performance variability and the risk of misinformation necessitate expert oversight and continued refinement before widespread clinical implementation.
Collapse
Affiliation(s)
- Weiqin Liu
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
- Changdu People's Hospital of Xizang, Xizang, China
| | - Hong Wei
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| | - Lingling Xiang
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| | - Yin Liu
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
- Special Care Nursery, Port Moresby General Hospital, Port Moresby, Papua New Guinea
| | - Chunyi Wang
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| | - Ziyu Hua
- Department of Neonatology, National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, Children's Hospital of Chongqing Medical University, Chongqing, China
- Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| |
Collapse
|
21
|
Guo S, Li R, Li G, Chen W, Huang J, He L, Ma Y, Wang L, Zheng H, Tian C, Zhao Y, Pan X, Wan H, Liu D, Li Z, Lei J. Comparing ChatGPT's and Surgeon's Responses to Thyroid-related Questions From Patients. J Clin Endocrinol Metab 2025; 110:e841-e850. [PMID: 38597169 DOI: 10.1210/clinem/dgae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 04/03/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024]
Abstract
CONTEXT For some common thyroid-related conditions with high prevalence and long follow-up times, ChatGPT can be used to respond to common thyroid-related questions. OBJECTIVE In this cross-sectional study, we assessed the ability of ChatGPT (version GPT-4.0) to provide accurate, comprehensive, compassionate, and satisfactory responses to common thyroid-related questions. METHODS First, we obtained 28 thyroid-related questions from the Huayitong app, which together with the 2 interfering questions eventually formed 30 questions. Then, these questions were responded to by ChatGPT (on July 19, 2023), a junior specialist, and a senior specialist (on July 20, 2023) separately. Finally, 26 patients and 11 thyroid surgeons evaluated those responses on 4 dimensions: accuracy, comprehensiveness, compassion, and satisfaction. RESULTS Among the 30 questions and responses, ChatGPT's speed of response was faster than that of the junior specialist (8.69 [7.53-9.48] vs 4.33 [4.05-4.60]; P < .001) and the senior specialist (8.69 [7.53-9.48] vs 4.22 [3.36-4.76]; P < .001). The word count of the ChatGPT's responses was greater than that of both the junior specialist (341.50 [301.00-384.25] vs 74.50 [51.75-84.75]; P < .001) and senior specialist (341.50 [301.00-384.25] vs 104.00 [63.75-177.75]; P < .001). ChatGPT received higher scores than the junior specialist and senior specialist in terms of accuracy, comprehensiveness, compassion, and satisfaction in responding to common thyroid-related questions. CONCLUSION ChatGPT performed better than a junior specialist and senior specialist in answering common thyroid-related questions, but further research is needed to validate the logical ability of the ChatGPT for complex thyroid questions.
Collapse
Affiliation(s)
- Siyin Guo
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Ruicen Li
- Health Management Center, General Practice Medical Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Genpeng Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Wenjie Chen
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Jing Huang
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Linye He
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Yu Ma
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Liying Wang
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Hongping Zheng
- Department of Thyroid Surgery, General Surgery Ward 7, The First Hospital of Lanzhou University, Lanzhou, Gansu 730000, China
| | - Chunxiang Tian
- Chengdu Women's and Children's Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan 610031, China
| | - Yatong Zhao
- Thyroid Surgery, Zhengzhou Central Hospital Affiliated of Zhengzhou University, Zhengzhou, Henan 450007, China
| | - Xinmin Pan
- Department of Thyroid Surgery, General Surgery III, Gansu Provincial Hospital, Lanzhou, Gansu 730000, China
| | - Hongxing Wan
- Department of Oncology, Sanya People's Hospital, Sanya, Hainan 572000, China
| | - Dasheng Liu
- Department of Vascular Thyroid Surgery, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510120, China
| | - Zhihui Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Jianyong Lei
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| |
Collapse
|
22
|
Megafu M, Guerrero O, Yendluri A, Parsons BO, Galatz LM, Li X, Kelly JD, Parisien RL. ChatGPT and Gemini Are Not Consistently Concordant With the 2020 American Academy of Orthopaedic Surgeons Clinical Practice Guidelines When Evaluating Rotator Cuff Injury. Arthroscopy 2025:S0749-8063(25)00057-X. [PMID: 39914605 DOI: 10.1016/j.arthro.2025.01.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 01/02/2025] [Accepted: 01/18/2025] [Indexed: 03/04/2025]
Abstract
PURPOSE To evaluate the accuracy of suggestions given by ChatGPT and Gemini (previously known as "Bard"), 2 widely used publicly available large language models, to evaluate the management of rotator cuff injuries. METHODS The 2020 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPGs) were the basis for determining recommended and non-recommended treatments in this study. ChatGPT and Gemini were queried on 16 treatments based on these guidelines examining rotator cuff interventions. The responses were categorized as "concordant" or "discordant" with the AAOS CPGs. The Cohen κ coefficient was calculated to assess inter-rater reliability. RESULTS ChatGPT and Gemini showed concordance with the AAOS CPGs for 13 of the 16 treatments queried (81%) and 12 of the 16 treatments queried (75%), respectively. ChatGPT provided discordant responses with the AAOS CPGs for 3 treatments (19%), whereas Gemini provided discordant responses for 4 treatments (25%). Assessment of inter-rater reliability showed a Cohen κ coefficient of 0.98, signifying agreement between the raters in classifying the responses of ChatGPT and Gemini to the AAOS CPGs as being concordant or discordant. CONCLUSIONS ChatGPT and Gemini do not consistently provide responses that align with the AAOS CPGs. CLINICAL RELEVANCE This study provides evidence that cautions patients not to rely solely on artificial intelligence for recommendations about rotator cuff injuries.
Collapse
Affiliation(s)
- Michael Megafu
- Department of Orthopedic Surgery, University of Connecticut, Farmington, Connecticut, U.S.A..
| | - Omar Guerrero
- A.T. Still University School of Osteopathic Medicine in Arizona, Mesa, Arizona, U.S.A
| | - Avanish Yendluri
- Ichan School of Medicine at Mount Sinai, New York, New York, U.S.A
| | - Bradford O Parsons
- Department of Orthopedic Surgery, Mount Sinai, New York, New York, U.S.A
| | - Leesa M Galatz
- Department of Orthopedic Surgery, Mount Sinai, New York, New York, U.S.A
| | - Xinning Li
- Department of Orthopedic Surgery, Boston University School of Medicine, Boston, Massachusetts, U.S.A
| | - John D Kelly
- Department of Orthopedic Surgery, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A
| | | |
Collapse
|
23
|
Bradshaw TJ, Tie X, Warner J, Hu J, Li Q, Li X. Large Language Models and Large Multimodal Models in Medical Imaging: A Primer for Physicians. J Nucl Med 2025; 66:173-182. [PMID: 39819692 DOI: 10.2967/jnumed.124.268072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Accepted: 12/19/2024] [Indexed: 01/19/2025] Open
Abstract
Large language models (LLMs) are poised to have a disruptive impact on health care. Numerous studies have demonstrated promising applications of LLMs in medical imaging, and this number will grow as LLMs further evolve into large multimodal models (LMMs) capable of processing both text and images. Given the substantial roles that LLMs and LMMs will have in health care, it is important for physicians to understand the underlying principles of these technologies so they can use them more effectively and responsibly and help guide their development. This article explains the key concepts behind the development and application of LLMs, including token embeddings, transformer networks, self-supervised pretraining, fine-tuning, and others. It also describes the technical process of creating LMMs and discusses use cases for both LLMs and LMMs in medical imaging.
Collapse
Affiliation(s)
- Tyler J Bradshaw
- Department of Radiology, University of Wisconsin-Madison, Madison, Wisconsin;
| | - Xin Tie
- Department of Radiology, University of Wisconsin-Madison, Madison, Wisconsin
| | - Joshua Warner
- Department of Radiology, University of Wisconsin-Madison, Madison, Wisconsin
| | - Junjie Hu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin; and
| | - Quanzheng Li
- Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts
| | - Xiang Li
- Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
24
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
25
|
Aziz AAA, Abdelrahman HH, Hassan MG. The use of ChatGPT and Google Gemini in responding to orthognathic surgery-related questions: A comparative study. J World Fed Orthod 2025; 14:20-26. [PMID: 39490358 DOI: 10.1016/j.ejwf.2024.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 09/05/2024] [Accepted: 09/05/2024] [Indexed: 11/05/2024]
Abstract
AIM This study employed a quantitative approach to compare the reliability of responses provided by ChatGPT-3.5, ChatGPT-4, and Google Gemini in response to orthognathic surgery-related questions. MATERIAL AND METHODS The authors adapted a set of 64 questions encompassing all of the domains and aspects related to orthognathic surgery. One author submitted the questions to ChatGPT3.5, ChatGPT4, and Google Gemini. The AI-generated responses from the three platforms were recorded and evaluated by 2 blinded and independent experts. The reliability of AI-generated responses was evaluated using a tool for accuracy of information and completeness. In addition, the provision of definitive answers to close-ended questions, references, graphical elements, and advice to schedule consultations with a specialist were collected. RESULTS Although ChatGPT-3.5 achieved the highest information reliability score, the 3 LLMs showed similar reliability scores in providing responses to orthognathic surgery-related inquiries. Moreover, Google Gemini significantly included physician recommendations and provided graphical elements. Both ChatGPT-3.5 and -4 lacked these features. CONCLUSION This study shows that ChatGPT-3.5, ChatGPT-4, and Google Gemini can provide reliable responses to inquires about orthognathic surgery. However, Google Gemini stood out by incorporating additional references and illustrations within its responses. These findings highlight the need for an additional evaluation of AI capabilities across different healthcare domains.
Collapse
Affiliation(s)
- Ahmed A Abdel Aziz
- Department of Orthodontics, Faculty of Dentistry, Assiut University, Assiut, Egypt
| | - Hams H Abdelrahman
- Department of Pediatric Dentistry and Dental Public Health, Faculty of Dentistry, Alexandria University, Alexandria, Egypt
| | - Mohamed G Hassan
- Department of Orthodontics, Faculty of Dentistry, Assiut University, Assiut, Egypt; Division of Bone and Mineral Diseases, Department of Medicine, School of Medicine, Washington University in St. Louis, St. Louis, Missouri.
| |
Collapse
|
26
|
Chervonski E, Harish KB, Rockman CB, Sadek M, Teter KA, Jacobowitz GR, Berland TL, Lohr J, Moore C, Maldonado TS. Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients. Vascular 2025; 33:229-237. [PMID: 38500300 DOI: 10.1177/17085381241240550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
OBJECTIVES Generative artificial intelligence (AI) has emerged as a promising tool to engage with patients. The objective of this study was to assess the quality of AI responses to common patient questions regarding vascular surgery disease processes. METHODS OpenAI's ChatGPT-3.5 and Google Bard were queried with 24 mock patient questions spanning seven vascular surgery disease domains. Six experienced vascular surgery faculty at a tertiary academic center independently graded AI responses on their accuracy (rated 1-4 from completely inaccurate to completely accurate), completeness (rated 1-4 from totally incomplete to totally complete), and appropriateness (binary). Responses were also evaluated with three readability scales. RESULTS ChatGPT responses were rated, on average, more accurate than Bard responses (3.08 ± 0.33 vs 2.82 ± 0.40, p < .01). ChatGPT responses were scored, on average, more complete than Bard responses (2.98 ± 0.34 vs 2.62 ± 0.36, p < .01). Most ChatGPT responses (75.0%, n = 18) and almost half of Bard responses (45.8%, n = 11) were unanimously deemed appropriate. Almost one-third of Bard responses (29.2%, n = 7) were deemed inappropriate by at least two reviewers (29.2%), and two Bard responses (8.4%) were considered inappropriate by the majority. The mean Flesch Reading Ease, Flesch-Kincaid Grade Level, and Gunning Fog Index of ChatGPT responses were 29.4 ± 10.8, 14.5 ± 2.2, and 17.7 ± 3.1, respectively, indicating that responses were readable with a post-secondary education. Bard's mean readability scores were 58.9 ± 10.5, 8.2 ± 1.7, and 11.0 ± 2.0, respectively, indicating that responses were readable with a high-school education (p < .0001 for three metrics). ChatGPT's mean response length (332 ± 79 words) was higher than Bard's mean response length (183 ± 53 words, p < .001). There was no difference in the accuracy, completeness, readability, or response length of ChatGPT or Bard between disease domains (p > .05 for all analyses). CONCLUSIONS AI offers a novel means of educating patients that avoids the inundation of information from "Dr Google" and the time barriers of physician-patient encounters. ChatGPT provides largely valid, though imperfect, responses to myriad patient questions at the expense of readability. While Bard responses are more readable and concise, their quality is poorer. Further research is warranted to better understand failure points for large language models in vascular surgery patient education.
Collapse
Affiliation(s)
- Ethan Chervonski
- New York University Grossman School of Medicine, New York, NY, USA
| | - Keerthi B Harish
- New York University Grossman School of Medicine, New York, NY, USA
| | - Caron B Rockman
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Mikel Sadek
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Katherine A Teter
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Glenn R Jacobowitz
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Todd L Berland
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| | - Joann Lohr
- Dorn Veterans Affairs Medical Center, Columbia, SC, USA
| | | | - Thomas S Maldonado
- Division of Vascular & Endovascular Surgery, Department of Surgery, New York University Langone Health, New York, NY, USA
| |
Collapse
|
27
|
Gondode PG, Singh R, Mehta S, Singh S, Kumar S, Nayak SS. Artificial intelligence chatbots versus traditional medical resources for patient education on "Labor Epidurals": an evaluation of accuracy, emotional tone, and readability. Int J Obstet Anesth 2025; 61:104302. [PMID: 39657284 DOI: 10.1016/j.ijoa.2024.104302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 11/21/2024] [Accepted: 11/21/2024] [Indexed: 12/12/2024]
Abstract
BACKGROUND Labor epidural analgesia is a widely used method for pain relief in childbirth, yet information accessibility for expectant mothers remains a challenge. Artificial intelligence (AI) chatbots like Chat Generative Pre-Trained Transformer (ChatGPT) and Google Gemini offer potential solutions for improving patient education. This study evaluates the accuracy, readability, and emotional tone of AI chatbot responses compared to the American Society of Anesthesiologists (ASA) online materials on labor epidurals. METHODS Eight common questions about labor epidurals were posed to ChatGPT and Gemini. Seven obstetric anaesthesiologists evaluated the generated responses for accuracy and completeness on a 1-10 Likert scale, comparing them with ASA-sourced content. Statistical analysis (one-way ANOVA, Tukey HSD), sentiment analysis and readability metrics (Flesch Reading ease) were used to assess differences. RESULTS ASA materials scored highest for accuracy (8.80 ± 0.40) and readability, followed by Gemini and ChatGPT. Completeness scores showed ASA and Gemini performing significantly better than ChatGPT (P <0.001). ASA materials were the most accessible, while Gemini content was more complex. Sentiment analysis indicated a neutral tone for ASA and Gemini, with ChatGPT displaying a less consistent tone. CONCLUSION AI chatbots exhibit promise in patient education for labor epidurals but require improvements in readability and tone consistency to enhance engagement. Further refinement of AI chatbots may support more accessible, patient-centred healthcare information.
Collapse
Affiliation(s)
- Prakash Gyandev Gondode
- Department of Anaesthesiology Pain Medicine and Critical Care, All India Institute of Medical Sciences, New Delhi, India.
| | - Ram Singh
- Department of Anaesthesiology Pain Medicine and Critical Care, All India Institute of Medical Sciences, New Delhi, India.
| | - Swati Mehta
- Department of Anaesthesiology Pain Medicine and Critical Care, All India Institute of Medical Sciences, New Delhi, India.
| | - Sneha Singh
- Department of Anaesthesiology Pain Medicine and Critical Care, All India Institute of Medical Sciences, New Delhi, India.
| | - Subodh Kumar
- Department of Anaesthesiology Pain Medicine and Critical Care, All India Institute of Medical Sciences, New Delhi, India.
| | - Sudhansu Sekhar Nayak
- Department of Anaesthesiology Pain Medicine and Critical Care, All India Institute of Medical Sciences, New Delhi, India.
| |
Collapse
|
28
|
Saw SN, Yan YY, Ng KH. Current status and future directions of explainable artificial intelligence in medical imaging. Eur J Radiol 2025; 183:111884. [PMID: 39667118 DOI: 10.1016/j.ejrad.2024.111884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 11/18/2024] [Accepted: 12/05/2024] [Indexed: 12/14/2024]
Abstract
The inherent "black box" nature of AI algorithms presents a substantial barrier to the widespread adoption of the technology in clinical settings, leading to a lack of trust among users. This review begins by examining the foundational stages involved in the interpretation of medical images by radiologists and clinicians, encompassing both type 1 (fast thinking - ability of the brain to think and act intuitively) and type 2 (slow analytical - slow analytical, laborious approach to decision-making) decision-making processes. The discussion then delves into current Explainable AI (XAI) approaches, exploring both inherent and post-hoc explainability for medical imaging applications and highlighting the milestones achieved. XAI in medicine refers to AI system designed to provide transparent, interpretable, and understandable reasoning behind AI predictions or decisions. Additionally, the paper showcases some commercial AI medical systems that offer explanations through features such as heatmaps. Opportunities, challenges and potential avenues for advancing the field are also addressed. In conclusion, the review observes that state-of-the-art XAI methods are not mature enough for implementation, as the explanations they provide are challenging for medical experts to comprehend. Deeper understanding of the cognitive mechanisms by medical professionals is important in aiming to develop more interpretable XAI methods.
Collapse
Affiliation(s)
- Shier Nee Saw
- Department of Artificial Intelligence, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur 50603, Malaysia.
| | - Yet Yen Yan
- Department of Radiology, Changi General Hospital, Singapore; Radiological Sciences ACP, Duke-NUS Medical School, Singapore; Present Address: Department of Diagnostic Radiology, Mount Elizabeth Hospital, 3 Mount Elizabeth, Singapore 228510, Republic of Singapore
| | - Kwan Hoong Ng
- Department of Biomedical Imaging, Faculty of Medicine, Universiti Malaya, Kuala Lumpur 50603, Malaysia; Faculty of Medicine and Health Sciences, UCSI University, Port Dickson, Negeri Sembilan, Malaysia
| |
Collapse
|
29
|
Cohen ND, Ho M, McIntire D, Smith K, Kho KA. A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis. AJOG GLOBAL REPORTS 2025; 5:100405. [PMID: 39810943 PMCID: PMC11730533 DOI: 10.1016/j.xagr.2024.100405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2025] Open
Abstract
Introduction The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them. Objective This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them. Study Design Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers' strength of agreement in ranking the LLMs' responses for each item. Results Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence. Conclusion The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.
Collapse
Affiliation(s)
- Natalie D. Cohen
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Milan Ho
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Donald McIntire
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Katherine Smith
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Kimberly A. Kho
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| |
Collapse
|
30
|
Sabri H, Saleh MHA, Hazrati P, Merchant K, Misch J, Kumar PS, Wang H, Barootchi S. Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education. J Periodontal Res 2025; 60:121-133. [PMID: 39030766 PMCID: PMC11873669 DOI: 10.1111/jre.13323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 06/24/2024] [Accepted: 06/25/2024] [Indexed: 07/22/2024]
Abstract
INTRODUCTION The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP). METHODS Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed: one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions. RESULTS ChatGPT-4 (total average: 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (p < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (p = .01), 73.29% (p = .02), 75.73% (p < .01), and 72.18% (p = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% ± 31.67) and second-year residents (66.25% ± 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% ± 30.45). CONCLUSIONS Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.
Collapse
Affiliation(s)
- Hamoun Sabri
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
- Center for Clinical Research and Evidence Synthesis in Oral Tissue Regeneration (CRITERION)Ann ArborMichiganUSA
| | - Muhammad H. A. Saleh
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
| | - Parham Hazrati
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
| | | | - Jonathan Misch
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
- Private PracticeAnn ArborMichiganUSA
| | - Purnima S. Kumar
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
| | - Hom‐Lay Wang
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
| | - Shayan Barootchi
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
- Center for Clinical Research and Evidence Synthesis in Oral Tissue Regeneration (CRITERION)Ann ArborMichiganUSA
- Division of Periodontology, Department of Oral Medicine, Infection, and ImmunityHarvard School of Dental MedicineBostonMassachusettsUSA
| |
Collapse
|
31
|
Aldukhail S. Mapping the Landscape of Generative Language Models in Dental Education: A Comparison Between ChatGPT and Google Bard. EUROPEAN JOURNAL OF DENTAL EDUCATION : OFFICIAL JOURNAL OF THE ASSOCIATION FOR DENTAL EDUCATION IN EUROPE 2025; 29:136-148. [PMID: 39563479 DOI: 10.1111/eje.13056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Revised: 08/29/2024] [Accepted: 10/28/2024] [Indexed: 11/21/2024]
Abstract
Generative language models (LLMs) have shown great potential in various fields, including medicine and education. This study evaluated and compared ChatGPT 3.5 and Google Bard within dental education and research. METHODS We developed seven dental education-related queries to assess each model across various domains: their role in dental education, creation of specific exercises, simulations of dental problems with treatment options, development of assessment tools, proficiency in dental literature and their ability to identify, summarise and critique a specific article. Two blind reviewers scored the responses using defined metrics. The means and standard deviations of the scores were reported, and differences between the scores were analysed using Wilcoxon tests. RESULTS ChatGPT 3.5 outperformed Bard in several tasks, including the ability to create highly comprehensive, accurate, clear, relevant and specific exercises on dental concepts, generate simulations of dental problems with treatment options and develop assessment tools. On the other hand, Bard was successful in retrieving real research, and it was able to critique the article it selected. Statistically significant differences were noted between the average scores of the two models (p ≤ 0.05) for domains 1 and 3. CONCLUSION This study highlights the potential of LLMs as dental education tools, enhancing learning through virtual simulations and critical performance analysis. However, the variability in LLMs' performance underscores the need for targeted training, particularly in evidence-based content generation. It is crucial for educators, students and practitioners to exercise caution when considering the delegation of critical educational or healthcare decisions to computer systems.
Collapse
Affiliation(s)
- Shaikha Aldukhail
- Department of Preventive dental sciences, college of dentistry, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| |
Collapse
|
32
|
Andrikyan W, Sametinger SM, Kosfeld F, Jung-Poppe L, Fromm MF, Maas R, Nicolaus HF. Artificial intelligence-powered chatbots in search engines: a cross-sectional study on the quality and risks of drug information for patients. BMJ Qual Saf 2025; 34:100-109. [PMID: 39353736 PMCID: PMC11874309 DOI: 10.1136/bmjqs-2024-017476] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 08/22/2024] [Indexed: 10/04/2024]
Abstract
BACKGROUND Search engines often serve as a primary resource for patients to obtain drug information. However, the search engine market is rapidly changing due to the introduction of artificial intelligence (AI)-powered chatbots. The consequences for medication safety when patients interact with chatbots remain largely unexplored. OBJECTIVE To explore the quality and potential safety concerns of answers provided by an AI-powered chatbot integrated within a search engine. METHODOLOGY Bing copilot was queried on 10 frequently asked patient questions regarding the 50 most prescribed drugs in the US outpatient market. Patient questions covered drug indications, mechanisms of action, instructions for use, adverse drug reactions and contraindications. Readability of chatbot answers was assessed using the Flesch Reading Ease Score. Completeness and accuracy were evaluated based on corresponding patient drug information in the pharmaceutical encyclopaedia drugs.com. On a preselected subset of inaccurate chatbot answers, healthcare professionals evaluated likelihood and extent of possible harm if patients follow the chatbot's given recommendations. RESULTS Of 500 generated chatbot answers, overall readability implied that responses were difficult to read according to the Flesch Reading Ease Score. Overall median completeness and accuracy of chatbot answers were 100.0% (IQR 50.0-100.0%) and 100.0% (IQR 88.1-100.0%), respectively. Of the subset of 20 chatbot answers, experts found 66% (95% CI 50% to 85%) to be potentially harmful. 42% (95% CI 25% to 60%) of these 20 chatbot answers were found to potentially cause moderate to mild harm, and 22% (95% CI 10% to 40%) to cause severe harm or even death if patients follow the chatbot's advice. CONCLUSIONS AI-powered chatbots are capable of providing overall complete and accurate patient drug information. Yet, experts deemed a considerable number of answers incorrect or potentially harmful. Furthermore, complexity of chatbot answers may limit patient understanding. Hence, healthcare professionals should be cautious in recommending AI-powered search engines until more precise and reliable alternatives are available.
Collapse
Affiliation(s)
- Wahram Andrikyan
- Institute of Experimental and Clinical Pharmacology and Toxicology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Sophie Marie Sametinger
- Institute of Experimental and Clinical Pharmacology and Toxicology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Frithjof Kosfeld
- Institute of Experimental and Clinical Pharmacology and Toxicology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- GSK, Wavre, Belgium
| | - Lea Jung-Poppe
- Institute of Experimental and Clinical Pharmacology and Toxicology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Universitätsklinikum Erlangen, Erlangen, Germany
| | - Martin F Fromm
- Institute of Experimental and Clinical Pharmacology and Toxicology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- FAU NeW-Research Center New Bioactive Compounds, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Renke Maas
- Institute of Experimental and Clinical Pharmacology and Toxicology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- FAU NeW-Research Center New Bioactive Compounds, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Hagen F Nicolaus
- Institute of Experimental and Clinical Pharmacology and Toxicology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Universitätsklinikum Erlangen, Erlangen, Germany
| |
Collapse
|
33
|
Koyun M, Taskent I. Evaluation of Advanced Artificial Intelligence Algorithms' Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models. J Clin Med 2025; 14:571. [PMID: 39860577 PMCID: PMC11765597 DOI: 10.3390/jcm14020571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2024] [Revised: 01/15/2025] [Accepted: 01/16/2025] [Indexed: 01/27/2025] Open
Abstract
Background/Objectives: Acute ischemic stroke (AIS) is a leading cause of mortality and disability worldwide, with early and accurate diagnosis being critical for timely intervention and improved patient outcomes. This retrospective study aimed to assess the diagnostic performance of two advanced artificial intelligence (AI) models, Chat Generative Pre-trained Transformer (ChatGPT-4o) and Claude 3.5 Sonnet, in identifying AIS from diffusion-weighted imaging (DWI). Methods: The DWI images of a total of 110 cases (AIS group: n = 55, healthy controls: n = 55) were provided to the AI models via standardized prompts. The models' responses were compared to radiologists' gold-standard evaluations, and performance metrics such as sensitivity, specificity, and diagnostic accuracy were calculated. Results: Both models exhibited a high sensitivity for AIS detection (ChatGPT-4o: 100%, Claude 3.5 Sonnet: 94.5%). However, ChatGPT-4o demonstrated a significantly lower specificity (3.6%) compared to Claude 3.5 Sonnet (74.5%). The agreement with radiologists was poor for ChatGPT-4o (κ = 0.036; %95 CI: -0.013, 0.085) but good for Claude 3.5 Sonnet (κ = 0.691; %95 CI: 0.558, 0.824). In terms of the AIS hemispheric localization accuracy, Claude 3.5 Sonnet (67.2%) outperformed ChatGPT-4o (32.7%). Similarly, for specific AIS localization, Claude 3.5 Sonnet (30.9%) showed greater accuracy than ChatGPT-4o (7.3%), with these differences being statistically significant (p < 0.05). Conclusions: This study highlights the superior diagnostic performance of Claude 3.5 Sonnet compared to ChatGPT-4o in identifying AIS from DWI. Despite its advantages, both models demonstrated notable limitations in accuracy, emphasizing the need for further development before achieving full clinical applicability. These findings underline the potential of AI tools in radiological diagnostics while acknowledging their current limitations.
Collapse
Affiliation(s)
- Mustafa Koyun
- Department of Radiology, Kastamonu Training and Research Hospital, Kastamonu 37150, Turkey
| | - Ismail Taskent
- Department of Radiology, Kastamonu University, Kastamonu 37150, Turkey;
| |
Collapse
|
34
|
Koyun M, Cevval ZK, Reis B, Ece B. Detection of Intracranial Hemorrhage from Computed Tomography Images: Diagnostic Role and Efficacy of ChatGPT-4o. Diagnostics (Basel) 2025; 15:143. [PMID: 39857027 PMCID: PMC11763562 DOI: 10.3390/diagnostics15020143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 01/07/2025] [Accepted: 01/08/2025] [Indexed: 01/27/2025] Open
Abstract
Background/Objectives: The role of artificial intelligence (AI) in radiological image analysis is rapidly evolving. This study evaluates the diagnostic performance of Chat Generative Pre-trained Transformer Omni (GPT-4 Omni) in detecting intracranial hemorrhages (ICHs) in non-contrast computed tomography (NCCT) images, along with its ability to classify hemorrhage type, stage, anatomical location, and associated findings. Methods: A retrospective study was conducted using 240 cases, comprising 120 ICH cases and 120 controls with normal findings. Five consecutive NCCT slices per case were selected by radiologists and analyzed by ChatGPT-4o using a standardized prompt with nine questions. Diagnostic accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated by comparing the model's results with radiologists' assessments (the gold standard). After a two-week interval, the same dataset was re-evaluated to assess intra-observer reliability and consistency. Results: ChatGPT-4o achieved 100% accuracy in identifying imaging modality type. For ICH detection, the model demonstrated a diagnostic accuracy of 68.3%, sensitivity of 79.2%, specificity of 57.5%, PPV of 65.1%, and NPV of 73.4%. It correctly classified 34.0% of hemorrhage types and 7.3% of localizations. All ICH-positive cases were identified as acute phase (100%). In the second evaluation, diagnostic accuracy improved to 73.3%, with a sensitivity of 86.7% and a specificity of 60%. The Cohen's Kappa coefficient for intra-observer agreement in ICH detection indicated moderate agreement (κ = 0.469). Conclusions: ChatGPT-4o shows promise in identifying imaging modalities and ICH presence but demonstrates limitations in localization and hemorrhage type classification. These findings highlight its potential for improvement through targeted training for medical applications.
Collapse
Affiliation(s)
- Mustafa Koyun
- Department of Radiology, Kastamonu Training and Research Hospital, Kastamonu 37150, Turkey;
| | - Zeycan Kubra Cevval
- Department of Radiology, Kastamonu Training and Research Hospital, Kastamonu 37150, Turkey;
| | - Bahadir Reis
- Department of Radiology, Kastamonu University, Kastamonu 37150, Turkey; (B.R.); (B.E.)
| | - Bunyamin Ece
- Department of Radiology, Kastamonu University, Kastamonu 37150, Turkey; (B.R.); (B.E.)
| |
Collapse
|
35
|
Derbal Y. Adaptive Treatment of Metastatic Prostate Cancer Using Generative Artificial Intelligence. Clin Med Insights Oncol 2025; 19:11795549241311408. [PMID: 39776668 PMCID: PMC11701910 DOI: 10.1177/11795549241311408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Accepted: 12/12/2024] [Indexed: 01/11/2025] Open
Abstract
Despite the expanding therapeutic options available to cancer patients, therapeutic resistance, disease recurrence, and metastasis persist as hallmark challenges in the treatment of cancer. The rise to prominence of generative artificial intelligence (GenAI) in many realms of human activities is compelling the consideration of its capabilities as a potential lever to advance the development of effective cancer treatments. This article presents a hypothetical case study on the application of generative pre-trained transformers (GPTs) to the treatment of metastatic prostate cancer (mPC). The case explores the design of GPT-supported adaptive intermittent therapy for mPC. Testosterone and prostate-specific antigen (PSA) are assumed to be repeatedly monitored while treatment may involve a combination of androgen deprivation therapy (ADT), androgen receptor-signalling inhibitors (ARSI), chemotherapy, and radiotherapy. The analysis covers various questions relevant to the configuration, training, and inferencing of GPTs for the case of mPC treatment with a particular attention to risk mitigation regarding the hallucination problem and its implications to clinical integration of GenAI technologies. The case study provides elements of an actionable pathway to the realization of GenAI-assisted adaptive treatment of metastatic prostate cancer. As such, the study is expected to help facilitate the design of clinical trials of GenAI-supported cancer treatments.
Collapse
Affiliation(s)
- Youcef Derbal
- Ted Rogers School of Information Technology Management, Toronto Metropolitan University, Toronto, ON, Canada
| |
Collapse
|
36
|
Taymour N, Fouda SM, Abdelrahaman HH, Hassan MG. Performance of the ChatGPT-3.5, ChatGPT-4, and Google Gemini large language models in responding to dental implantology inquiries. J Prosthet Dent 2025:S0022-3913(24)00833-3. [PMID: 39757053 DOI: 10.1016/j.prosdent.2024.12.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2024] [Revised: 12/06/2024] [Accepted: 12/10/2024] [Indexed: 01/07/2025]
Abstract
STATEMENT OF PROBLEM Artificial intelligence (AI) chatbots have been proposed as promising resources for oral health information. However, the quality and readability of existing online health-related information is often inconsistent and challenging. PURPOSE This study aimed to compare the reliability and usefulness of dental implantology-related information provided by the ChatGPT-3.5, ChatGPT-4, and Google Gemini large language models (LLMs). MATERIAL AND METHODS A total of 75 questions were developed covering various dental implant domains. These questions were then presented to 3 different LLMs: ChatGPT-3.5, ChatGPT-4, and Google Gemini. The responses generated were recorded and independently assessed by 2 specialists who were blinded to the source of the responses. The evaluation focused on the accuracy of the generated answers using a modified 5-point Likert scale to measure the reliability and usefulness of the information provided. Additionally, the ability of the AI-chatbots to offer definitive responses to closed questions, provide reference citation, and advise scheduling consultations with a dental specialist was also analyzed. The Friedman, Mann Whitney U and Spearman Correlation tests were used for data analysis (α=.05). RESULTS Google Gemini exhibited higher reliability and usefulness scores compared with ChatGPT-3.5 and ChatGPT-4 (P<.001). Google Gemini also demonstrated superior proficiency in identifying closed questions (25 questions, 41%) and recommended specialist consultations for 74 questions (98.7%), significantly outperforming ChatGPT-4 (30 questions, 40.0%) and ChatGPT-3.5 (28 questions, 37.3%) (P<.001). A positive correlation was found between reliability and usefulness scores, with Google Gemini showing the strongest correlation (ρ=.702). CONCLUSIONS The 3 AI Chatbots showed acceptable levels of reliability and usefulness in addressing dental implant-related queries. Google Gemini distinguished itself by providing responses consistent with specialist consultations.
Collapse
Affiliation(s)
- Noha Taymour
- Lecturer, Department of Substitutive Dental Sciences, College of Dentistry, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia.
| | - Shaimaa M Fouda
- Lecturer, Department of Substitutive Dental Sciences, College of Dentistry, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia
| | - Hams H Abdelrahaman
- Assistant Lecturer, Department of Pediatric Dentistry, and Dental Public Health, Faculty of Dentistry, Alexandria University, Alexandria, Egypt
| | - Mohamed G Hassan
- Postdoctoral Research Associate, Division of Bone and Mineral Diseases, Department of Internal Medicine, School of Medicine, Washington University in St. Louis, St. Louis, MO; and Lecturer, Department of Orthodontics, Faculty of Dentistry, Assiut University, Assiut, Egypt
| |
Collapse
|
37
|
Tarris G, Martin L. Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language. Digit Health 2025; 11:20552076241310630. [PMID: 39896270 PMCID: PMC11786284 DOI: 10.1177/20552076241310630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 12/09/2024] [Indexed: 02/04/2025] Open
Abstract
Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers. Objective This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies' (DFGSM2) Pathology exams from the Health Sciences Center of Dijon (France) in 2018-2022. Methods From 2018 to 2022, exam scores, discriminating powers and discordance rates were retrieved. Seventy questions (25 first-order single response questions and 45 second-order multiple response questions) were submitted on May 2023 to ChatGPT 3.5 and Bard 2.0, and on September 2024 to Gemini 1.5 and ChatGPT-4. Chatbot's and student's average scores were compared, as well as discriminating powers of questions answered by chatbots. The percentage of student-chatbot identical answers was retrieved, and linear regression analysis correlated the scores of chatbots with student's discordance rates. Chatbot's reliability was assessed by submitting the questions in four successive rounds and comparing score variability using a Fleiss' Kappa and a Cohen's Kappa. Results Newer chatbots outperformed both students and older chatbots as for the overall scores and multiple-response questions. All chatbots outperformed students on less discriminating questions. Oppositely, all chatbots were outperformed by students to questions with a high discriminating power. Chatbot's scores were correlated to student discordance rates. ChatGPT 4 and Gemini 1.5 provided variable answers, due to effects linked to prompt engineering. Conclusion Our study in line with the literature confirms chatbot's moderate performance for questions requiring complex reasoning, with ChatGPT outperforming Google chatbots. The use of NLP software based on distributional semantics remains a challenge for the generation of questions in French. Drawbacks to the use of NLP software in generating questions include the generation of hallucinations and erroneous medical knowledge which have to be taken into count when using NLP software in medical education.
Collapse
Affiliation(s)
- Georges Tarris
- Department of Pathology, University Hospital François Mitterrand of Dijon–Bourgogne, Dijon, France
- University of Burgundy Health Sciences Center, Dijon, France
| | - Laurent Martin
- Department of Pathology, University Hospital François Mitterrand of Dijon–Bourgogne, Dijon, France
- University of Burgundy Health Sciences Center, Dijon, France
| |
Collapse
|
38
|
Gosak L, Štiglic G, Pruinelli L, Vrbnjak D. PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation. J Nurs Scholarsh 2025; 57:5-16. [PMID: 39582233 PMCID: PMC11771709 DOI: 10.1111/jnu.13036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 10/24/2024] [Accepted: 11/07/2024] [Indexed: 11/26/2024]
Abstract
AIM The aim of this study was to evaluate and compare artificial intelligence (AI)-based large language models (LLMs) (ChatGPT-3.5, Bing, and Bard) with human-based formulations in generating relevant clinical queries, using comprehensive methodological evaluations. METHODS To interact with the major LLMs ChatGPT-3.5, Bing Chat, and Google Bard, scripts and prompts were designed to formulate PICOT (population, intervention, comparison, outcome, time) clinical questions and search strategies. Quality of the LLMs responses was assessed using a descriptive approach and independent assessment by two researchers. To determine the number of hits, PubMed, Web of Science, Cochrane Library, and CINAHL Ultimate search results were imported separately, without search restrictions, with the search strings generated by the three LLMs and an additional one by the expert. Hits from one of the scenarios were also exported for relevance evaluation. The use of a single scenario was chosen to provide a focused analysis. Cronbach's alpha and intraclass correlation coefficient (ICC) were also calculated. RESULTS In five different scenarios, ChatGPT-3.5 generated 11,859 hits, Bing 1,376,854, Bard 16,583, and an expert 5919 hits. We then used the first scenario to assess the relevance of the obtained results. The human expert search approach resulted in 65.22% (56/105) relevant articles. Bing was the most accurate AI-based LLM with 70.79% (63/89), followed by ChatGPT-3.5 with 21.05% (12/45), and Bard with 13.29% (42/316) relevant hits. Based on the assessment of two evaluators, ChatGPT-3.5 received the highest score (M = 48.50; SD = 0.71). Results showed a high level of agreement between the two evaluators. Although ChatGPT-3.5 showed a lower percentage of relevant hits compared to Bing, this reflects the nuanced evaluation criteria, where the subjective evaluation prioritized contextual accuracy and quality over mere relevance. CONCLUSION This study provides valuable insights into the ability of LLMs to formulate PICOT clinical questions and search strategies. AI-based LLMs, such as ChatGPT-3.5, demonstrate significant potential for augmenting clinical workflows, improving clinical query development, and supporting search strategies. However, the findings also highlight limitations that necessitate further refinement and continued human oversight. CLINICAL RELEVANCE AI could assist nurses in formulating PICOT clinical questions and search strategies. AI-based LLMs offer valuable support to healthcare professionals by improving the structure of clinical questions and enhancing search strategies, thereby significantly increasing the efficiency of information retrieval.
Collapse
Affiliation(s)
- Lucija Gosak
- Faculty of Health SciencesUniversity of MariborMariborSlovenia
| | - Gregor Štiglic
- Faculty of Health SciencesUniversity of MariborMariborSlovenia
- Faculty of Electrical Engineering and Computer ScienceUniversity of MariborMariborSlovenia
- Usher InstituteUniversity of EdinburghEdinburghUK
| | - Lisiane Pruinelli
- College of Nursing and College of MedicineUniversity of FloridaGainesvilleFloridaUSA
| | | |
Collapse
|
39
|
Pandya S, Bresler TE, Wilson T, Htway Z, Fujita M. Decoding the NCCN Guidelines With AI: A Comparative Evaluation of ChatGPT-4.0 and Llama 2 in the Management of Thyroid Carcinoma. Am Surg 2025; 91:94-98. [PMID: 39136578 DOI: 10.1177/00031348241269430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2024]
Abstract
INTRODUCTION Artificial Intelligence (AI) has emerged as a promising tool in the delivery of health care. ChatGPT-4.0 (OpenAI, San Francisco, California) and Llama 2 (Meta, Menlo Park, CA) have each gained attention for their use in various medical applications. OBJECTIVE This study aims to evaluate and compare the effectiveness of ChatGPT-4.0 and Llama 2 in assisting with complex clinical decision making in the diagnosis and treatment of thyroid carcinoma. PARTICIPANTS We reviewed the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for the management of thyroid carcinoma and formulated up to 3 complex clinical questions for each decision-making page. ChatGPT-4.0 and Llama 2 were queried in a reproducible manner. The answers were scored on a Likert scale: 5) Correct; 4) correct, with missing information requiring clarification; 3) correct, but unable to complete answer; 2) partially incorrect; 1) absolutely incorrect. Score frequencies were compared, and subgroup analysis was conducted on Correctness (defined as scores 1-2 vs 3-5) and Accuracy (scores 1-3 vs 4-5). RESULTS In total, 58 pages of the NCCN Guidelines® were analyzed, generating 167 unique questions. There was no statistically significant difference between ChatGPT-4.0 and Llama 2 in terms of overall score (Mann-Whitney U-test; Mean Rank = 160.53 vs 174.47, P = 0.123), Correctness (P = 0.177), or Accuracy (P = 0.891).[Formula: see text]. CONCLUSION ChatGPT-4.0 and Llama 2 demonstrate a limited but substantial capacity to assist with complex clinical decision making relating to the management of thyroid carcinoma, with no significant difference in their effectiveness.
Collapse
Affiliation(s)
- Shivam Pandya
- Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
| | - Tamir E Bresler
- Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
| | - Tyler Wilson
- Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
| | - Zin Htway
- Department of Laboratory, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
| | - Manabu Fujita
- Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA
- General Surgical Associates, Thousand Oaks, CA, USA
| |
Collapse
|
40
|
Gupta A, Malhotra H, Garg AK, Rangarajan K. Enhancing Radiological Reporting in Head and Neck Cancer: Converting Free-Text CT Scan Reports to Structured Reports Using Large Language Models. Indian J Radiol Imaging 2025; 35:43-49. [PMID: 39697521 PMCID: PMC11651842 DOI: 10.1055/s-0044-1788589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2024] Open
Abstract
Objective The aim of this study was to assess efficacy of large language models (LLMs) for converting free-text computed tomography (CT) scan reports of head and neck cancer (HNCa) patients into a structured format using a predefined template. Materials and Methods A retrospective study was conducted using 150 CT reports of HNCa patients. A comprehensive structured reporting template for HNCa CT scans was developed, and the Generative Pre-trained Transformer 4 (GPT-4) was initially used to convert 50 CT reports into a structured format using this template. The generated structured reports were then evaluated by a radiologist for instances of missing or misinterpreted information and any erroneous additional details added by GPT-4. Following this assessment, the template was refined for improved accuracy. This revised template was then used for conversion of 100 other HNCa CT reports into structured format using GPT-4. These reports were then reevaluated in the same manner. Results Initially, GPT-4 successfully converted all 50 free-text reports into structured reports. However, there were 10 places with missing information: tracheostomy tube ( n = 3), noninclusion of involvement of sternocleidomastoid muscle ( n = 2), extranodal tumor extension ( n = 3), and contiguous involvement of the neck structures by nodal mass rather than the primary ( n = 2). Few instances of nonsuspicious lung nodules were misinterpreted as metastases ( n = 2). GPT-4 did not indicate any erroneous additional findings. Using the revised reporting template, GPT-4 converted all the 100 CT reports into a structured format with no repeated or additional mistakes. Conclusion LLMs can be used for structuring free-text radiology reports using plain language prompts and a simple yet comprehensive reporting template. Key Points Structured radiology reports in oncological patients, although advantageous, are not used widely in practice due to perceived drawbacks like interference with routine radiology workflow and scan interpretation.We found that GPT-4 is highly efficient in converting conventional CT reports of HNCa patients to structured reports using a predefined template.This application of LLMs in radiology can help in enhancing the acceptability and clinical utility of structured radiology reports in oncological imaging. Summary Statement Large language models can successfully and accurately convert conventional radiology reports for oncology scans into a structured format using a comprehensive predefined template and thus can enhance the utility and integration of these reports in routine clinical practice.
Collapse
Affiliation(s)
- Amit Gupta
- Department of Radiodiagnosis, All India Institute of Medical Sciences New Delhi, New Delhi, India
| | - Hema Malhotra
- Department of Radiology, Dr. Bhim Rao Ambedkar Institute Rotary Cancer Hospital, All India Institute of Medical Sciences New Delhi, India
| | - Amit K. Garg
- Indian Institute of Technology, New Delhi, India
| | - Krithika Rangarajan
- Department of Radiology, Dr. Bhim Rao Ambedkar Institute Rotary Cancer Hospital, All India Institute of Medical Sciences New Delhi, India
| |
Collapse
|
41
|
Aydinbelge-Dizdar N, Dizdar K. Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes. Rev Esp Med Nucl Imagen Mol 2025; 44:500065. [PMID: 39349172 DOI: 10.1016/j.remnie.2024.500065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 09/10/2024] [Accepted: 09/16/2024] [Indexed: 10/02/2024]
Abstract
PURPOSE This study aimed to evaluate the reliability and readability of responses generated by two popular AI-chatbots, 'ChatGPT-4.0' and 'Google Gemini', to potential patient questions about PET/CT scans. MATERIALS AND METHODS Thirty potential questions for each of [18F]FDG and [68Ga]Ga-DOTA-SSTR PET/CT, and twenty-nine potential questions for [68Ga]Ga-PSMA PET/CT were asked separately to ChatGPT-4 and Gemini in May 2024. The responses were evaluated for reliability and readability using the modified DISCERN (mDISCERN) scale, Flesch Reading Ease (FRE), Gunning Fog Index (GFI), and Flesch-Kincaid Reading Grade Level (FKRGL). The inter-rater reliability of mDISCERN scores provided by three raters (ChatGPT-4, Gemini, and a nuclear medicine physician) for the responses was assessed. RESULTS The median [min-max] mDISCERN scores reviewed by the physician for responses about FDG, PSMA and DOTA PET/CT scans were 3.5 [2-4], 3 [3-4], 3 [3-4] for ChatPT-4 and 4 [2-5], 4 [2-5], 3.5 [3-5] for Gemini, respectively. The mDISCERN scores assessed using ChatGPT-4 for answers about FDG, PSMA, and DOTA-SSTR PET/CT scans were 3.5 [3-5], 3 [3-4], 3 [2-3] for ChatGPT-4, and 4 [3-5], 4 [3-5], 4 [3-5] for Gemini, respectively. The mDISCERN scores evaluated using Gemini for responses FDG, PSMA, and DOTA-SSTR PET/CTs were 3 [2-4], 2 [2-4], 3 [2-4] for ChatGPT-4, and 3 [2-5], 3 [1-5], 3 [2-5] for Gemini, respectively. The inter-rater reliability correlation coefficient of mDISCERN scores for ChatGPT-4 responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.629 (95% CI = 0,32-0,812), 0.707 (95% CI = 0.458-0.853) and 0.738 (95% CI = 0.519-0.866), respectively (p < 0.001). The correlation coefficient of mDISCERN scores for Gemini responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.824 (95% CI = 0.677-0.910), 0.881 (95% CI = 0.78-0.94) and 0.847 (95% CI = 0.719-0.922), respectively (p < 0.001). The mDISCERN scores assessed by ChatGPT-4, Gemini, and the physician showed that the chatbots' responses about all PET/CT scans had moderate to good statistical agreement according to the inter-rater reliability correlation coefficient (p < 0,001). There was a statistically significant difference in all readability scores (FKRGL, GFI, and FRE) of ChatGPT-4 and Gemini responses about PET/CT scans (p < 0,001). Gemini responses were shorter and had better readability scores than ChatGPT-4 responses. CONCLUSION There was an acceptable level of agreement between raters for the mDISCERN score, indicating agreement with the overall reliability of the responses. However, the information provided by AI-chatbots cannot be easily read by the public.
Collapse
Affiliation(s)
- N Aydinbelge-Dizdar
- Department of Nuclear Medicine, Ankara Etlik City Hospital, Ankara, Turkiye.
| | - K Dizdar
- Department of Software Engineering, ASELSAN Inc., Ankara, Turkiye.
| |
Collapse
|
42
|
Pirkle S, Yang J, Blumberg TJ. Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions? J Pediatr Orthop 2025; 45:e66-e71. [PMID: 39171426 DOI: 10.1097/bpo.0000000000002797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 08/23/2024]
Abstract
BACKGROUND Artificial intelligence (AI), and in particular large language models (LLMs) such as Chat Generative Pre-Trained Transformer (ChatGPT) and Gemini have provided additional resources for patients to research the management of healthcare conditions, for their own edification and the advocacy in the care of their children. The accuracy of these models, however, and the sources from which they draw conclusions, have been largely unstudied in pediatric orthopaedics. This research aimed to assess the reliability of machine learning tools in providing appropriate recommendations for the care of common pediatric orthopaedic conditions. METHODS ChatGPT and Gemini were queried using plain language generated from the American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPGs) listed on the Pediatric Orthopedic Society of North America (POSNA) web page. Two independent reviewers assessed the accuracy of the responses, and chi-square analyses were used to compare the 2 LLMs. Inter-rater reliability was calculated via Cohen's Kappa coefficient. If research studies were cited, attempts were made to assess their legitimacy by searching the PubMed and Google Scholar databases. RESULTS ChatGPT and Gemini performed similarly, agreeing with the AAOS CPGs at a rate of 67% and 69%. No significant differences were observed in the performance between the 2 LLMs. ChatGPT did not reference specific studies in any response, whereas Gemini referenced a total of 16 research papers in 6 of 24 responses. 12 of the 16 studies referenced contained errors and either were unable to be identified (7) or contained discrepancies (5) regarding publication year, journal, or proper accreditation of authorship. CONCLUSION The LLMs investigated were frequently aligned with the AAOS CPGs; however, the rate of neutral statements or disagreement with consensus recommendations was substantial and frequently contained errors with citations of sources. These findings suggest there remains room for growth and transparency in the development of the models which power AI, and they may not yet represent the best source of up-to-date healthcare information for patients or providers.
Collapse
Affiliation(s)
- Sean Pirkle
- Department of Orthopaedics and Sports Medicine, University of Washington
| | - JaeWon Yang
- Department of Orthopaedics and Sports Medicine, University of Washington
| | - Todd J Blumberg
- Department of Orthopaedics and Sports Medicine, University of Washington
- Department of Orthopaedics and Sports Medicine, Seattle Children's Hospital, Seattle, WA
| |
Collapse
|
43
|
Mondal H, Tiu DN, Mondal S, Dutta R, Naskar A, Podder I. Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots. J Midlife Health 2025; 16:45-50. [PMID: 40330238 PMCID: PMC12052287 DOI: 10.4103/jmh.jmh_182_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 11/22/2024] [Accepted: 12/02/2024] [Indexed: 05/08/2025] Open
Abstract
Background The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women. Methods Twenty questions on midlife health were asked to six different LLM chatbots - ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot's responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population. Results In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education. Conclusion LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men's and women's midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability.
Collapse
Affiliation(s)
- Himel Mondal
- Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| | - Devendra Nath Tiu
- Department of Physiology, Sheikh Bhikhari Medical College, Hazaribagh, Jharkhand, India
| | - Shaikat Mondal
- Department of Physiology, Raiganj Government Medical College and Hospital, Raiganj, West Bengal, India
| | - Rajib Dutta
- Department of Gynecology and Obstetrics, Diamond Harbour Government Medical College and Hospital, Diamond Harbour, West Bengal, India
| | - Avijit Naskar
- Department of General Medicine, Baruipur Sub-Divisional Hospital, Baruipur, West Bengal, India
| | - Indrashis Podder
- Department of Dermatology, College of Medicine and Sagore Dutta Hospital, Kolkata, West Bengal, India
| |
Collapse
|
44
|
Zong H, Wu R, Cha J, Wang J, Wu E, Li J, Zhou Y, Zhang C, Feng W, Shen B. Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis. J Med Internet Res 2024; 26:e66114. [PMID: 39729356 PMCID: PMC11724220 DOI: 10.2196/66114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 11/06/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024] Open
Abstract
BACKGROUND Large language models (LLMs) are increasingly integrated into medical education, with transformative potential for learning and assessment. However, their performance across diverse medical exams globally has remained underexplored. OBJECTIVE This study aims to introduce MedExamLLM, a comprehensive platform designed to systematically evaluate the performance of LLMs on medical exams worldwide. Specifically, the platform seeks to (1) compile and curate performance data for diverse LLMs on worldwide medical exams; (2) analyze trends and disparities in LLM capabilities across geographic regions, languages, and contexts; and (3) provide a resource for researchers, educators, and developers to explore and advance the integration of artificial intelligence in medical education. METHODS A systematic search was conducted on April 25, 2024, in the PubMed database to identify relevant publications. Inclusion criteria encompassed peer-reviewed, English-language, original research articles that evaluated at least one LLM on medical exams. Exclusion criteria included review articles, non-English publications, preprints, and studies without relevant data on LLM performance. The screening process for candidate publications was independently conducted by 2 researchers to ensure accuracy and reliability. Data, including exam information, data process information, model performance, data availability, and references, were manually curated, standardized, and organized. These curated data were integrated into the MedExamLLM platform, enabling its functionality to visualize and analyze LLM performance across geographic, linguistic, and exam characteristics. The web platform was developed with a focus on accessibility, interactivity, and scalability to support continuous data updates and user engagement. RESULTS A total of 193 articles were included for final analysis. MedExamLLM comprised information for 16 LLMs on 198 medical exams conducted in 28 countries across 15 languages from the year 2009 to the year 2023. The United States accounted for the highest number of medical exams and related publications, with English being the dominant language used in these exams. The Generative Pretrained Transformer (GPT) series models, especially GPT-4, demonstrated superior performance, achieving pass rates significantly higher than other LLMs. The analysis revealed significant variability in the capabilities of LLMs across different geographic and linguistic contexts. CONCLUSIONS MedExamLLM is an open-source, freely accessible, and publicly available online platform providing comprehensive performance evaluation information and evidence knowledge about LLMs on medical exams around the world. The MedExamLLM platform serves as a valuable resource for educators, researchers, and developers in the fields of clinical medicine and artificial intelligence. By synthesizing evidence on LLM capabilities, the platform provides valuable insights to support the integration of artificial intelligence into medical education. Limitations include potential biases in the data source and the exclusion of non-English literature. Future research should address these gaps and explore methods to enhance LLM performance in diverse contexts.
Collapse
Affiliation(s)
- Hui Zong
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Rongrong Wu
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Jiaxue Cha
- Shanghai Key Laboratory of Signaling and Disease Research, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Jiao Wang
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Erman Wu
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
- Department of Neurosurgery, First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
| | - Jiakun Li
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
- Department of Urology, West China Hospital, Sichuan University, Chengdu, China
| | - Yi Zhou
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Chi Zhang
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Weizhe Feng
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
| | - Bairong Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China
- West China Tianfu Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
45
|
Parillo M, Vaccarino F, Beomonte Zobel B, Mallio CA. ChatGPT and radiology report: potential applications and limitations. LA RADIOLOGIA MEDICA 2024; 129:1849-1863. [PMID: 39508933 DOI: 10.1007/s11547-024-01915-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Accepted: 10/28/2024] [Indexed: 11/15/2024]
Abstract
Large language models like ChatGPT, with their growing accessibility, are attracting increasing interest within the artificial intelligence medical field, particularly in the analysis of radiology reports. These present a valuable opportunity to explore the potential clinical applications of large language models, given their huge capabilities in processing and understanding written language. Early research indicates that ChatGPT could offer benefits in radiology reporting. ChatGPT can assist but not replace radiologists in achieving diagnoses, generating structured reports, extracting data, identifying errors or incidental findings, and can also serve as a support in creating patient-friendly reports. However, ChatGPT also has intrinsic limitations, such as hallucinations, stochasticity, biases, deficiencies in complex clinical scenarios, data privacy and legal concerns. To fully utilize the potential of ChatGPT in radiology reporting, careful integration planning and rigorous validation of their outputs are crucial, especially for tasks requiring abstract reasoning or nuanced medical context. Radiologists' expertise in medical imaging and data analysis positions them exceptionally well to lead the responsible integration and utilization of ChatGPT within the field of radiology. This article offers a topical overview of the potential strengths and limitations of ChatGPT in radiological reporting.
Collapse
Affiliation(s)
- Marco Parillo
- Radiology, Multizonal Unit of Rovereto and Arco, APSS Provincia Autonoma Di Trento, Trento, Italy.
| | - Federica Vaccarino
- Radiology, Multizonal Unit of Rovereto and Arco, APSS Provincia Autonoma Di Trento, Trento, Italy
- Research Unit of Diagnostic Imaging and Interventional Radiology, Department of Medicine and Surgery, Università Campus Bio-Medico di Roma, Via Alvaro del Portillo, 21, 00128, Rome, Italy
| | - Bruno Beomonte Zobel
- Fondazione Policlinico Universitario Campus Bio-Medico, Via Alvaro del Portillo, 200, 00128, Rome, Italy
- Research Unit of Diagnostic Imaging and Interventional Radiology, Department of Medicine and Surgery, Università Campus Bio-Medico di Roma, Via Alvaro del Portillo, 21, 00128, Rome, Italy
| | - Carlo Augusto Mallio
- Fondazione Policlinico Universitario Campus Bio-Medico, Via Alvaro del Portillo, 200, 00128, Rome, Italy
- Research Unit of Diagnostic Imaging and Interventional Radiology, Department of Medicine and Surgery, Università Campus Bio-Medico di Roma, Via Alvaro del Portillo, 21, 00128, Rome, Italy
| |
Collapse
|
46
|
Cao JJ, Kwon DH, Ghaziani TT, Kwo P, Tse G, Kesselman A, Kamaya A, Tse JR. Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom Radiol (NY) 2024; 49:4286-4294. [PMID: 39088019 DOI: 10.1007/s00261-024-04501-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 07/10/2024] [Accepted: 07/13/2024] [Indexed: 08/02/2024]
Abstract
PURPOSE To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management. METHODS Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests. RESULTS Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001). CONCLUSION Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
Collapse
Affiliation(s)
- Jennie J Cao
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA
| | - Daniel H Kwon
- Department of Medicine, San Francisco School of Medicine, University of California, 505 Parnassus Ave, MC1286C, San Francisco, CA, 94144, USA
| | - Tara T Ghaziani
- Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA
| | - Paul Kwo
- Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA
| | - Gary Tse
- Department of Radiological Sciences, Los Angeles David Geffen School of Medicine, University of California, 757 Westwood Plaza Los Angeles, Los Angeles, CA, 90095, USA
| | - Andrew Kesselman
- Department of Radiology, Stanford University School of Medicine, 875 Blake Wilbur Drive Palo Alto, Stanford, CA, 94304, USA
| | - Aya Kamaya
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA
| | - Justin R Tse
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.
| |
Collapse
|
47
|
Naz R, Akacı O, Erdoğan H, Açıkgöz A. Can large language models provide accurate and quality information to parents regarding chronic kidney diseases? J Eval Clin Pract 2024; 30:1556-1564. [PMID: 38959373 DOI: 10.1111/jep.14084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 06/24/2024] [Indexed: 07/05/2024]
Abstract
RATIONALE Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human-like text responses to user queries across topics. The use of these language models in various medical contexts is currently being studied. However, the performance and content quality of these language models have not been evaluated in specific medical fields. AIMS AND OBJECTIVES This study aimed to compare the performance of AI LLMs ChatGPT, Gemini and Copilot in providing information to parents about chronic kidney diseases (CKD) and compare the information accuracy and quality with that of a reference source. METHODS In this study, 40 frequently asked questions about CKD were identified. The accuracy and quality of the answers were evaluated with reference to the Kidney Disease: Improving Global Outcomes guidelines. The accuracy of the responses generated by LLMs was assessed using F1, precision and recall scores. The quality of the responses was evaluated using a five-point global quality score (GQS). RESULTS ChatGPT and Gemini achieved high F1 scores of 0.89 and 1, respectively, in the diagnosis and lifestyle categories, demonstrating significant success in generating accurate responses. Furthermore, ChatGPT and Gemini were successful in generating accurate responses with high precision values in the diagnosis and lifestyle categories. In terms of recall values, all LLMs exhibited strong performance in the diagnosis, treatment and lifestyle categories. Average GQ scores for the responses generated were 3.46 ± 0.55, 1.93 ± 0.63 and 2.02 ± 0.69 for Gemini, ChatGPT 3.5 and Copilot, respectively. In all categories, Gemini performed better than ChatGPT and Copilot. CONCLUSION Although LLMs provide parents with high-accuracy information about CKD, their use is limited compared with that of a reference source. The limitations in the performance of LLMs can lead to misinformation and potential misinterpretations. Therefore, patients and parents should exercise caution when using these models.
Collapse
Affiliation(s)
- Rüya Naz
- Bursa Yüksek Ihtisas Research and Training Hospital, University of Health Sciences, Bursa, Turkey
| | - Okan Akacı
- Clinic of Pediatric Nephrology, Bursa Yüksek Ihtisas Research and Training Hospital, University of Health Sciences, Bursa, Turkey
| | - Hakan Erdoğan
- Clinic of Pediatric Nephrology, Bursa City Hospital, Bursa, Turkey
| | - Ayfer Açıkgöz
- Department of Pediatric Nursing, Faculty of Health Sciences, Eskişehir Osmangazi University, Eskişehir, Turkey
| |
Collapse
|
48
|
Patel S, Patel R. Embracing Large Language Models for Adult Life Support Learning. Cureus 2024; 16:e75961. [PMID: 39698196 PMCID: PMC11654997 DOI: 10.7759/cureus.75961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/17/2024] [Indexed: 12/20/2024] Open
Abstract
Background It is recognised that large language models (LLMs) may aid medical education by supporting the understanding of explanations behind answers to multiple choice questions. This study aimed to evaluate the efficacy of LLM chatbots ChatGPT and Bard in answering an Intermediate Life Support pre-course multiple choice question (MCQs) test developed by the Resuscitation Council UK focused on managing deteriorating patients and identifying causes and treating cardiac arrest. We assessed the accuracy of responses and quality of explanations to evaluate the utility of the chatbots. Methods The performance of the AI chatbots ChatGPT-3.5 and Bard were assessed on their ability to choose the correct answer and provide clear comprehensive explanations in answering MCQs developed by the Resuscitation Council UK for their Intermediate Life Support Course. Ten MCQs were tested with a total score of 40, with one point scored for each accurate response to each statement a-d. In a separate scoring, questions were scored out of 1 if all sub-statements a-d were correct, to give a total score out of 10 for the test. The explanations provided by the AI chatbots were evaluated by three qualified physicians as per a rating scale from 0-3 for each overall question and median rater scores calculated and compared. The Fleiss multi-rater kappa (κ) was used to determine the score agreement among the three raters. Results When scoring each overall question to give a total score out of 10, Bard outperformed ChatGPT although the difference was not significant (p=0.37). Furthermore, there was no statistically significant difference in the performance of ChatGPT compared to Bard when scoring each sub-question separately to give a total score out of 40 (p=0.26). The qualities of explanations were similar for both LLMs. Importantly, despite answering certain questions incorrectly, both AI chatbots provided some useful correct information in their explanations of the answers to these questions. The Fleiss multi-rater kappa was 0.899 (p<0.001) for ChatGPT and 0.801 (p<0.001) for Bard. Conclusions The performances of both Bard and ChatGPT were similar in answering the MCQs with similar scores achieved. Notably, despite having access to data across the web, neither of the LLMs answered all questions accurately. This suggests that there is still learning required of AI models in medical education.
Collapse
Affiliation(s)
- Serena Patel
- General Surgery, Imperial College NHS Trust, Ilford, GBR
| | - Rohit Patel
- Oral and Maxillofacial Surgery, Kings College Hospital, London, GBR
| |
Collapse
|
49
|
Gumilar KE, Indraprasta BR, Faridzi AS, Wibowo BM, Herlambang A, Rahestyningtyas E, Irawan B, Tambunan Z, Bustomi AF, Brahmantara BN, Yu ZY, Hsu YC, Pramuditya H, Putra VGE, Nugroho H, Mulawardhana P, Tjokroprawiro BA, Hedianto T, Ibrahim IH, Huang J, Li D, Lu CH, Yang JY, Liao LN, Tan M. Assessment of Large Language Models (LLMs) in decision-making support for gynecologic oncology. Comput Struct Biotechnol J 2024; 23:4019-4026. [PMID: 39610903 PMCID: PMC11603009 DOI: 10.1016/j.csbj.2024.10.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2024] [Revised: 10/30/2024] [Accepted: 10/30/2024] [Indexed: 11/30/2024] Open
Abstract
Objective This study investigated the ability of Large Language Models (LLMs) to provide accurate and consistent answers by focusing on their performance in complex gynecologic cancer cases. Background LLMs are advancing rapidly and require a thorough evaluation to ensure that they can be safely and effectively used in clinical decision-making. Such evaluations are essential for confirming LLM reliability and accuracy in supporting medical professionals in casework. Study design We assessed three prominent LLMs-ChatGPT-4 (CG-4), Gemini Advanced (GemAdv), and Copilot-evaluating their accuracy, consistency, and overall performance. Fifteen clinical vignettes of varying difficulty and five open-ended questions based on real patient cases were used. The responses were coded, randomized, and evaluated blindly by six expert gynecologic oncologists using a 5-point Likert scale for relevance, clarity, depth, focus, and coherence. Results GemAdv demonstrated superior accuracy (81.87 %) compared to both CG-4 (61.60 %) and Copilot (70.67 %) across all difficulty levels. GemAdv consistently provided correct answers more frequently (>60 % every day during the testing period). Although CG-4 showed a slight advantage in adhering to the National Comprehensive Cancer Network (NCCN) treatment guidelines, GemAdv excelled in the depth and focus of the answers provided, which are crucial aspects of clinical decision-making. Conclusion LLMs, especially GemAdv, show potential in supporting clinical practice by providing accurate, consistent, and relevant information for gynecologic cancer. However, further refinement is needed for more complex scenarios. This study highlights the promise of LLMs in gynecologic oncology, emphasizing the need for ongoing development and rigorous evaluation to maximize their clinical utility and reliability.
Collapse
Affiliation(s)
- Khanisyah Erza Gumilar
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
- Department of Obstetrics and Gynecology, Hospital of Universitas Airlangga - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Birama R. Indraprasta
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Ach Salman Faridzi
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Bagus M. Wibowo
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Aditya Herlambang
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Eccita Rahestyningtyas
- Department of Obstetrics and Gynecology, Hospital of Universitas Airlangga - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Budi Irawan
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Zulkarnain Tambunan
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Ahmad Fadhli Bustomi
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Bagus Ngurah Brahmantara
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Zih-Ying Yu
- Department of Public Health, China Medical University, Taichung, Taiwan
| | - Yu-Cheng Hsu
- Department of Public Health, China Medical University, Taichung, Taiwan
- School of Chinese Medicine, China Medical University, Taichung, Taiwan
| | - Herlangga Pramuditya
- Department of Obstetrics and Gynecology, Dr. Ramelan Naval Hospital, Surabaya, Indonesia
| | - Very Great E. Putra
- Department of Obstetrics and Gynecology, Dr. Kariadi Central General Hospital, Semarang, Indonesia
| | - Hari Nugroho
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Pungky Mulawardhana
- Department of Obstetrics and Gynecology, Hospital of Universitas Airlangga - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Brahmana A. Tjokroprawiro
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital - Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Tri Hedianto
- Faculty of Medicine and Health, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
| | - Ibrahim H. Ibrahim
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
| | - Jingshan Huang
- School of Computing, College of Medicine, University of South Alabama, Mobile, AL, USA
| | - Dongqi Li
- School of Information and Computer Sciences, School of Social and Behavioral Sciences, University of California, Irvine, CA, USA
| | - Chien-Hsing Lu
- Department of Obstetrics and Gynecology, Taichung Veteran General Hospital, Taichung, Taiwan
| | - Jer-Yen Yang
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
| | - Li-Na Liao
- Department of Public Health, China Medical University, Taichung, Taiwan
| | - Ming Tan
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
- Institute of Biochemistry and Molecular Biology and Research Center for Cancer Biology, China Medical University, Taichung, Taiwan
| |
Collapse
|
50
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|