1
|
Christou CD, Sitsiani O, Boutos P, Katsanos G, Papadakis G, Tefas A, Papalois V, Tsoulfas G. Comparison of ChatGPT-3.5 and GPT-4 as potential tools in artificial intelligence-assisted clinical practice in renal and liver transplantation. World J Transplant 2025; 15:103536. [DOI: 10.5500/wjt.v15.i3.103536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/21/2024] [Revised: 01/26/2025] [Accepted: 03/05/2025] [Indexed: 04/18/2025] Open
Abstract
BACKGROUND Kidney and liver transplantation are two sub-specialized medical disciplines, with transplant professionals spending decades in training. While artificial intelligence-based (AI-based) tools could potentially assist in everyday clinical practice, comparative assessment of their effectiveness in clinical decision-making remains limited.
AIM To compare the use of ChatGPT and GPT-4 as potential tools in AI-assisted clinical practice in these challenging disciplines.
METHODS In total, 400 different questions tested ChatGPT’s/GPT-4 knowledge and decision-making capacity in various renal and liver transplantation concepts. Specifically, 294 multiple-choice questions were derived from open-access sources, 63 questions were derived from published open-access case reports, and 43 from unpublished cases of patients treated at our department. The evaluation covered a plethora of topics, including clinical predictors, treatment options, and diagnostic criteria, among others.
RESULTS ChatGPT correctly answered 50.3% of the 294 multiple-choice questions, while GPT-4 demonstrated a higher performance, answering 70.7% of questions (P < 0.001). Regarding the 63 questions from published cases, ChatGPT achieved an agreement rate of 50.79% and partial agreement of 17.46%, while GPT-4 demonstrated an agreement rate of 80.95% and partial agreement of 9.52% (P = 0.01). Regarding the 43 questions from unpublished cases, ChatGPT demonstrated an agreement rate of 53.49% and partial agreement of 23.26%, while GPT-4 demonstrated an agreement rate of 72.09% and partial agreement of 6.98% (P = 0.004). When factoring by the nature of the task for all cases, notably, GPT-4 demonstrated outstanding performance, providing a differential diagnosis that included the final diagnosis in 90% of the cases (P = 0.008), and successfully predicting the prognosis of the patient in 100% of related questions (P < 0.001).
CONCLUSION GPT-4 consistently provided more accurate and reliable clinical recommendations with higher percentages of full agreements both in renal and liver transplantation compared with ChatGPT. Our findings support the potential utility of AI models like ChatGPT and GPT-4 in AI-assisted clinical practice as sources of accurate, individualized medical information and facilitating decision-making. The progression and refinement of such AI-based tools could reshape the future of clinical practice, making their early adoption and adaptation by physicians a necessity.
Collapse
Affiliation(s)
- Chrysanthos D Christou
- Center for Research and Innovation in Solid Organ Transplantation, School of Medicine, Aristotle University of Thessaloniki, Thessaloniki 54622, Greece
| | - Olga Sitsiani
- School of Medicine, Aristotle University of Thessaloniki, Thessaloniki 54622, Greece
| | - Panagiotis Boutos
- School of Medicine, Aristotle University of Thessaloniki, Thessaloniki 54622, Greece
| | - Georgios Katsanos
- Center for Research and Innovation in Solid Organ Transplantation, School of Medicine, Aristotle University of Thessaloniki, Thessaloniki 54622, Greece
| | - Georgios Papadakis
- Department of Nephrology and Transplantation, Guy’s Hospital, Guy’s and St Thomas’ NHS Foundation Trust, London SE1 1UL, United Kingdom
| | - Anastasios Tefas
- Computational Intelligence and Deep Learning Group, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54636, Greece
| | - Vassilios Papalois
- Renal and Transplant Unit, Hammersmith Hospital, Imperial College Healthcare NHS Trust, London W120HS, United Kingdom
| | - Georgios Tsoulfas
- Center for Research and Innovation in Solid Organ Transplantation, School of Medicine, Aristotle University of Thessaloniki, Thessaloniki 54622, Greece
| |
Collapse
|
2
|
Rothchild E, Jung G, Aiello C, Tanna N, Ricci JA. Advancing emergency upper extremity care: A pilot study of ChatGPT's potential role in diagnosing and managing hand and wrist trauma. J Hand Microsurg 2025; 17:100260. [PMID: 40352662 PMCID: PMC12059322 DOI: 10.1016/j.jham.2025.100260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2025] [Revised: 04/19/2025] [Accepted: 04/22/2025] [Indexed: 05/14/2025] Open
Abstract
Purpose Hand and wrist trauma is a frequent cause of emergency room (ER) visits. However, hospitals often lack immediate hand specialist coverage. This study aims to evaluate the efficacy of Artificial Intelligence (AI) platforms like ChatGPT in aiding in the diagnosis and patient management of upper extremity trauma. Methods Ten clinical vignettes depicting common hand and wrist emergency clinical situations were created by the senior author to represent a broad range of common upper extremity injuries. These were presented to plastic surgery residents and ChatGPT (version 4.0). The responder was tasked to provide a diagnosis, ER management, and definitive treatment plans for each vignette. Responses were collected and scored by two attending plastic surgeons, blinded to the source, on a scale of 0 (poor) to 30 (excellent). Univariate and linear regression models were utilized for analysis. Results A total of 16 resident responses (9 junior and 7 senior) and 16 ChatGPT responses were collected for each of the 10 clinical scenarios. ChatGPT had significantly higher total average scores (mean = 26.6 vs. 22.7, p < 0.05) and ER management scores (mean = 9.9 vs. 6.7, p < 0.05) when compared to residents. We did not find any notable differences in diagnosis or definitive treatment scores between residents and ChatGPT responses. However, the study was not sufficiently powered to detect smaller effect sizes in these areas. No apparent correlations between scores and resident year of training were observed. Conclusions ChatGPT provided clinically accurate diagnosis and management plans for upper extremity trauma. Implementing AI in trauma management has the potential to improve the management of hand and wrist trauma in emergency settings by serving as a diagnostic and clinical reference tool for emergency medical providers. However, their integration into clinical practice should be carefully evaluated and focused on complementing, and not replacing, traditional consults. Ultimately, these tools could alleviate the burden placed on ERs and limit reliance on hand consults.
Collapse
Affiliation(s)
- Evan Rothchild
- Division of Plastic Surgery, Montefiore Medical Center/Albert Einstein College of Medicine, New York, NY, USA
| | - Geena Jung
- Division of Plastic Surgery, Montefiore Medical Center/Albert Einstein College of Medicine, New York, NY, USA
| | | | - Neil Tanna
- Division of Plastic Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Great Neck, NY, USA
| | - Joseph A. Ricci
- Division of Plastic Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Great Neck, NY, USA
| |
Collapse
|
3
|
Do V, Donohoe KL, Peddi AN, Carr E, Kim C, Mele V, Patel D, Crawford AN. Artificial intelligence (AI) performance on pharmacy skills laboratory course assignments. CURRENTS IN PHARMACY TEACHING & LEARNING 2025; 17:102367. [PMID: 40273883 DOI: 10.1016/j.cptl.2025.102367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/20/2024] [Revised: 04/10/2025] [Accepted: 04/15/2025] [Indexed: 04/26/2025]
Abstract
OBJECTIVE To compare pharmacy student scores to scores of artificial intelligence (AI)-generated results of three common platforms on pharmacy skills laboratory assignments. METHODS Pharmacy skills laboratory course assignments were completed by four fourth-year pharmacy student investigators with three free AI platforms: ChatGPT, Copilot, and Gemini. Assignments evaluated were calculations, patient case vignettes, in-depth patient cases, drug information questions, and a reflection activity. Course coordinators graded the AI-generated submissions. Descriptive statistics were utilized to summarize AI scores and compare averages to recent pharmacy student cohorts. Interrater reliability for the four student investigators completing the assignments was assessed. RESULTS Fourteen skills laboratory assignments were completed utilizing three different AI platforms (ChatGPT, Copilot, and Gemini) by four fourth-year student investigators (n = 168 AI-generated submissions). Copilot was unable to complete 12; therefore, 156 AI-generated submissions were graded by the faculty course coordinators for accuracy and scored from 0 to 100 %. Pharmacy student cohort scores were higher than the average AI scores for all of the skills laboratory assignments except for two in-depth patient cases completed with ChatGPT. CONCLUSION Pharmacy students on average performed better on most skills laboratory assignments than three commonly used artificial intelligence platforms. Teaching students the strengths and weaknesses of utilizing AI in the classroom is essential.
Collapse
Affiliation(s)
- Vivian Do
- Class of 2025, Virginia Commonwealth University School of Pharmacy, Richmond, VA, United States of America.
| | - Krista L Donohoe
- BCPS, BCGP, Virginia Commonwealth University School of Pharmacy, Richmond, VA, United States of America.
| | - Apryl N Peddi
- BCACP, Virginia Commonwealth University School of Pharmacy, Richmond, VA, United States of America.
| | - Eleanor Carr
- Class of 2025, Virginia Commonwealth University School of Pharmacy, Richmond, VA, United States of America.
| | - Christina Kim
- Class of 2025, Virginia Commonwealth University School of Pharmacy, Richmond, VA, United States of America.
| | - Virginia Mele
- Class of 2025, Virginia Commonwealth University School of Pharmacy, Richmond, VA, United States of America.
| | - Dhruv Patel
- Class of 2025, Virginia Commonwealth University School of Pharmacy, Richmond, VA, United States of America.
| | - Alexis N Crawford
- BCCCP, BCPS, Virginia Commonwealth University School of Pharmacy, Richmond, VA, United States of America.
| |
Collapse
|
4
|
Aubrey JM, Liefeld HR, Sharrak A, Kolbeinsson HM, Yang A, Wright GP, Chung M. Artificial intelligence generated personal statements are difficult to distinguish from human personal statements by general surgery program directors. Am J Surg 2025; 245:116255. [PMID: 39984330 DOI: 10.1016/j.amjsurg.2025.116255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Revised: 01/22/2025] [Accepted: 02/14/2025] [Indexed: 02/23/2025]
Affiliation(s)
- Jason M Aubrey
- General Surgery Residency, Corewell Health West, Grand Rapids, MI, USA; Michigan State University, College of Human Medicine, Grand Rapids, MI, USA.
| | - Hannah R Liefeld
- General Surgery Residency, Corewell Health West, Grand Rapids, MI, USA; Michigan State University, College of Human Medicine, Grand Rapids, MI, USA
| | - Aryana Sharrak
- General Surgery Residency, Corewell Health West, Grand Rapids, MI, USA; Michigan State University, College of Human Medicine, Grand Rapids, MI, USA
| | - Hordur M Kolbeinsson
- General Surgery Residency, Corewell Health West, Grand Rapids, MI, USA; Michigan State University, College of Human Medicine, Grand Rapids, MI, USA
| | - Amanda Yang
- General Surgery Residency, Corewell Health West, Grand Rapids, MI, USA; Michigan State University, College of Human Medicine, Grand Rapids, MI, USA; Section of Acute Care Surgery, Corewell Health West, Grand Rapids, MI, USA
| | - G Paul Wright
- General Surgery Residency, Corewell Health West, Grand Rapids, MI, USA; Michigan State University, College of Human Medicine, Grand Rapids, MI, USA; Section of Surgical Oncology, Corewell Health West, Grand Rapids, MI, USA
| | - Mathew Chung
- General Surgery Residency, Corewell Health West, Grand Rapids, MI, USA; Michigan State University, College of Human Medicine, Grand Rapids, MI, USA; Section of Surgical Oncology, Corewell Health West, Grand Rapids, MI, USA
| |
Collapse
|
5
|
de Vries P, Baud D, Baggio S, Ceulemans M, Favre G, Gerbier E, Legardeur H, Maisonneuve E, Pena-Reyes C, Pomar L, Winterfeld U, Panchaud A. Enhancing perinatal health patient information through ChatGPT - An accuracy study. PEC INNOVATION 2025; 6:100381. [PMID: 40028463 PMCID: PMC11872132 DOI: 10.1016/j.pecinn.2025.100381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 01/24/2025] [Accepted: 02/08/2025] [Indexed: 03/05/2025]
Abstract
Objectives To evaluate ChatGPT's accuracy as information source for women and maternity-care workers on "nutrition" and "red flags" in pregnancy. Methods Accuracy of ChatGPT-generated recommendations was assessed by a 5-point Likert scale by eight raters for ten indicators per topic in four languages (French, English, German and Dutch). Accuracy and interrater agreement were calculated per topic and language. Results For both topics, median accuracy scores of ChatGPT-generated recommendations were excellent (5.0; IQR 4-5) independently of language. Median accuracy scores varied with a maximum of 1 on a 5-point Likert-scare according to question's framing. Overall accuracy scores were 83-89 % for 'nutrition in pregnancy' versus 96-98 % for 'red flags in pregnancy'. Inter-rater agreement was good to excellent for both topics. Conclusion Although ChatGPT generated accurate recommendations regarding the tested indicators for nutrition and red flags during pregnancy, women should be aware of ChatGPT's limitations such as inconsistencies according to formulation, language and the woman's personal context. Innovation Despite a growing interest in the potential use of artificial intelligence in healthcare, this is, to the best of our knowledge, the first study assessing potential limitations that may impact accuracy of ChatGPT-generated recommendations such as language and question-framing in key domains of perinatal health.
Collapse
Affiliation(s)
- P.L.M. de Vries
- Department of Gynecology and Obstetrics, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| | - D. Baud
- Department of Gynecology and Obstetrics, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| | - S. Baggio
- Institute of Primary Health Care (BIHAM), University of Bern, Bern, Switzerland
- Laboratory of Population Health (#PopHealthLab), University of Fribourg, Fribourg, Switzerland
| | - M. Ceulemans
- Clinical Pharmacology and Pharmacotherapy, Department of Pharmaceutical and Pharmacological Sciences, KU Leuven, 3000 Leuven, Belgium
- L-C&Y, Child and Youth Institute, KU Leuven, 3000 Leuven, Belgium
- Department for Health Evidence, Radboud University Medical Center, 6500 HB Nijmegen, the Netherlands
| | - G. Favre
- Department of Gynecology and Obstetrics, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| | - E. Gerbier
- Department of Gynecology and Obstetrics, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| | - H. Legardeur
- Department of Gynecology and Obstetrics, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| | - E. Maisonneuve
- Department of Gynecology and Obstetrics, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
- Institute of Primary Health Care (BIHAM), University of Bern, Bern, Switzerland
- Graduate School for Health Sciences (GHS), University of Bern, Bern, Switzerland
| | - C. Pena-Reyes
- Institute for Information and Communication Technologies (IICT), School of Engineering and Management Vaud (HEIG-VD), HES-SO University of Applied Sciences and Arts Western, Switzerland
- Computational Intelligence for Computational Biology (CI4CB), Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - L. Pomar
- Department of Gynecology and Obstetrics, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
- School of Health Sciences (HESAV), University of Applied Sciences and Arts Western Switzerland, Lausanne, Switzerland
| | - U. Winterfeld
- Swiss Teratogen Information Service and Clinical Pharmacology Service, Centre Hospitalier Universitaire Vaudois (CHUV) and University of Lausanne, Lausanne, Switzerland
| | - A. Panchaud
- Institute of Primary Health Care (BIHAM), University of Bern, Bern, Switzerland
- Service of Pharmacy, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
| |
Collapse
|
6
|
Restrepo-Rodas G, Barajas-Gamboa JS, Ortiz Aparicio FM, Pantoja JP, Abril C, Al-Baqain S, Rodriguez J, Guerron AD. The Role of AI in Modern Hernia Surgery: A Review and Practical Insights. Surg Innov 2025; 32:301-311. [PMID: 40104921 DOI: 10.1177/15533506251328481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]
Abstract
BackgroundArtificial intelligence (AI) is revolutionizing various aspects of health care, particularly in the surgical field, where it offers significant potential for improving surgical risk assessment, predictive analytics, and research advancement. Despite the development of numerous AI models in surgery, there remains a notable gap in understanding their specific application within the context of hernia surgery.PurposeThis review aims to explore the evolution of AI utilization in hernia surgery over the past 2 decades, focusing on the contributions of Machine Learning (ML), Natural Language Processing (NLP), Computer Vision (CV), and Robotics.ResultsWe discuss how these AI fields enhance surgical outcomes and advance research in the domain of hernia surgery. ML focuses on developing and training prediction models, while NLP enables seamless human-computer interaction through the use of Large Language Models (LLMs). CV assists in critical view detection, which is crucial in procedures such as inguinal hernia repair, and robotics improves minimally invasive techniques, dexterity, and precision. We examine recent evidence and the applicability of various AI models on hernia patients, considering the strengths, limitations, and future possibilities within each field.ConclusionBy consolidating the impact of AI models on hernia surgery, this review provides insights into the potential of AI for advancing patient care and surgical techniques in this field, ultimately contributing to the ongoing evolution of surgical practice.
Collapse
Affiliation(s)
- Gabriela Restrepo-Rodas
- Hernia and Core Health Center, Department of General Surgery, Digestive Disease Institute, Abu Dhabi, United Arab Emirates
| | - Juan S Barajas-Gamboa
- Hernia and Core Health Center, Department of General Surgery, Digestive Disease Institute, Abu Dhabi, United Arab Emirates
| | - Freddy Miguel Ortiz Aparicio
- Hernia and Core Health Center, Department of General Surgery, Digestive Disease Institute, Abu Dhabi, United Arab Emirates
| | - Juan Pablo Pantoja
- Hernia and Core Health Center, Department of General Surgery, Digestive Disease Institute, Abu Dhabi, United Arab Emirates
| | - Carlos Abril
- Hernia and Core Health Center, Department of General Surgery, Digestive Disease Institute, Abu Dhabi, United Arab Emirates
| | - Suleiman Al-Baqain
- Hernia and Core Health Center, Department of General Surgery, Digestive Disease Institute, Abu Dhabi, United Arab Emirates
| | - John Rodriguez
- Hernia and Core Health Center, Department of General Surgery, Digestive Disease Institute, Abu Dhabi, United Arab Emirates
| | - Alfredo D Guerron
- Hernia and Core Health Center, Department of General Surgery, Digestive Disease Institute, Abu Dhabi, United Arab Emirates
| |
Collapse
|
7
|
Hu X, Xu D, Zhang H, Tang M, Gao Q. Comparative diagnostic accuracy of ChatGPT-4 and machine learning in differentiating spinal tuberculosis and spinal tumors. Spine J 2025; 25:1196-1205. [PMID: 39805470 DOI: 10.1016/j.spinee.2024.12.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 12/02/2024] [Accepted: 12/30/2024] [Indexed: 01/16/2025]
Abstract
BACKGROUND In clinical practice, distinguishing between spinal tuberculosis (STB) and spinal tumors (ST) poses a significant diagnostic challenge. The application of AI-driven large language models (LLMs) shows great potential for improving the accuracy of this differential diagnosis. PURPOSE To evaluate the performance of various machine learning models and ChatGPT-4 in distinguishing between STB and ST. STUDY DESIGN A retrospective cohort study. PATIENT SAMPLE A total of 143 STB cases and 153 ST cases admitted to Xiangya Hospital Central South University, from January 2016 to June 2023 were collected. OUTCOME MEASURES This study incorporates basic patient information, standard laboratory results, serum tumor markers, and comprehensive imaging records, including Magnetic Resonance Imaging (MRI) and Computed Tomography (CT), for individuals diagnosed with STB and ST. Machine learning techniques and ChatGPT-4 were utilized to distinguish between STB and ST separately. METHOD Six distinct machine learning models, along with ChatGPT-4, were employed to evaluate their differential diagnostic effectiveness. RESULT Among the 6 machine learning models, the Gradient Boosting Machine (GBM) algorithm model demonstrated the highest differential diagnostic efficiency. In the training cohort, the GBM model achieved a sensitivity of 98.84% and a specificity of 100.00% in distinguishing STB from ST. In the testing cohort, its sensitivity was 98.25%, and specificity was 91.80%. ChatGPT-4 exhibited a sensitivity of 70.37% and a specificity of 90.65% for differential diagnosis. In single-question cases, ChatGPT-4's sensitivity and specificity were 71.67% and 92.55%, respectively, while in re-questioning cases, they were 44.44% and 76.92%. CONCLUSION The GBM model demonstrates significant value in the differential diagnosis of STB and ST, whereas the diagnostic performance of ChatGPT-4 remains suboptimal.
Collapse
Affiliation(s)
- Xiaojiang Hu
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; Department of Orthopedics, The Second Xiangya Hospital of Central South University, Changsha, 410011, Hunan, China
| | - Dongcheng Xu
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; Department of Spine Surgery, The Third Xiangya Hospital, Central South University, Changsha, Hunan, 410013, China
| | - Hongqi Zhang
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha 410008, China
| | - Mingxing Tang
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha 410008, China.
| | - Qile Gao
- Department of Spine Surgery and Orthopaedics, Xiangya Hospital, Central South University, Changsha 410008, China; National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha 410008, China.
| |
Collapse
|
8
|
Magruder ML, Miskiewicz M, Rodriguez AN, Ng M, Abdelgawad A. Comparison of ChatGPT plus (version 4.0) and pretrained AI model (Orthopod) on orthopaedic in-training exam (OITE). Surgeon 2025; 23:187-191. [PMID: 40263060 DOI: 10.1016/j.surge.2025.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 04/01/2025] [Accepted: 04/14/2025] [Indexed: 04/24/2025]
Abstract
INTRODUCTION Recent advancements in large language model (LLM) artificial intelligence (AI) systems, like ChatGPT, have showcased ability in answering standardized examination questions, but their performance is variable. The goal of this study was to compare the performance of standard ChatGPT-4 with a custom-trained ChatGPT model taking the Orthopaedic Surgery In-Training Examination (OITE). METHODS Practice questions for the 2022 OITE, made available on the AAOS-ResStudy website (aaos.org/education/examinations/ResStudy), were used for this study. Question stems were uploaded to both standard ChatGPT-4 and the custom-trained ChatGPT model (Orthopod), and the responses were documented as correct or incorrect. For questions containing media elements, screenshots were converted to PNG files and uploaded to ChatGPT. Evaluation of the AI's performance included descriptive statistics to determine the percent of questions answered correctly or incorrectly. RESULTS Two-hundred and seven questions were analyzed with both ChatGPT 4.0 and Orthopod. ChatGPT correctly answered 73.43 % (152/207) of the questions, while Orthopod correctly answered 71.01 % (147/207) of the questions. There was no significant difference in performance of either language model based on inclusion of media or question category. CONCLUSION ChatGPT 4.0 and Orthopod correctly answered 73.43 % and 71.01 % of OITE practice questions correctly. Both systems provided well-reasoned answers in response to multiple choice questions. The thoughtfully articulated responses and well-supported explanations offered by both systems may prove to be a valuable educational resource for orthopedic residents as they prepare for upcoming board-style exams. LEVEL OF EVIDENCE IV.
Collapse
Affiliation(s)
- Matthew L Magruder
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Michael Miskiewicz
- Renaissance School of Medicine at Stony Brook University Medical Center, Stony Brook, NY, USA.
| | - Ariel N Rodriguez
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Mitchell Ng
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Amr Abdelgawad
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| |
Collapse
|
9
|
Dorfner FJ, Dada A, Busch F, Makowski MR, Han T, Truhn D, Kleesiek J, Sushil M, Adams LC, Bressem KK. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J Am Med Inform Assoc 2025; 32:1015-1024. [PMID: 40190132 DOI: 10.1093/jamia/ocaf045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Accepted: 03/02/2025] [Indexed: 05/21/2025] Open
Abstract
OBJECTIVES Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks. MATERIALS AND METHODS We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities. RESULTS Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate. DISCUSSION Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation. CONCLUSION Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.
Collapse
Affiliation(s)
- Felix J Dorfner
- Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin 10117, Germany
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, United States
| | - Amin Dada
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen 45131, Germany
| | - Felix Busch
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Marcus R Makowski
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Tianyu Han
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen 52074, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen 52074, Germany
| | - Jens Kleesiek
- Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen 45131, Germany
- Cancer Research Center Cologne Essen (CCCE), West German Cancer Center Essen, University Hospital Essen (AöR), Essen 45147, Germany
- German Cancer Consortium (DKTK, Partner Site Essen), Heidelberg, Germany
- Department of Physics, TU Dortmund, Dortmund 44227, Germany
| | - Madhumita Sushil
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
| | - Lisa C Adams
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
| | - Keno K Bressem
- Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich, Munich 81675, Germany
- German Heart Center Munich, Technical University Munich, Munich 80636, Germany
| |
Collapse
|
10
|
Sood A, Moyer A, Jahangiri P, Mar D, Nitichaikulvatana P, Ramreddy N, Stolyar L, Lin J. Evaluation of the Reliability of ChatGPT to Provide Guidance on Recombinant Zoster Vaccination for Patients With Rheumatic and Musculoskeletal Diseases. J Clin Rheumatol 2025; 31:156-161. [PMID: 39814338 DOI: 10.1097/rhu.0000000000002198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2025]
Abstract
INTRODUCTION Large language models (LLMs) such as ChatGPT can potentially transform the delivery of health information. This study aims to evaluate the accuracy and completeness of ChatGPT in responding to questions on recombinant zoster vaccination (RZV) in patients with rheumatic and musculoskeletal diseases. METHODS A cross-sectional study was conducted using 20 prompts based on information from the Centers for Disease Control and Prevention (CDC), the Advisory Committee on Immunization Practices (ACIP), and the American College of Rheumatology (ACR). These prompts were inputted into ChatGPT 3.5. Five rheumatologists independently scored the ChatGPT responses for accuracy (Likert 1 to 5) and completeness (Likert 1 to 3) compared with validated information sources (CDC, ACIP, and ACR). RESULTS The overall mean accuracy of ChatGPT responses on a 5-point scale was 4.04, with 80% of responses scoring ≥4. The mean completeness score of ChatGPT response on a 3-point scale was 2.3, with 95% of responses scoring ≥2. Among the 5 raters, ChatGPT unanimously scored with high accuracy and completeness to various patient and physician questions surrounding RZV. There was one instance where it scored with low accuracy and completeness. Although not significantly different, ChatGPT demonstrated the highest accuracy and completeness in answering questions related to ACIP guidelines compared with other information sources. CONCLUSIONS ChatGPT exhibits promising ability to address specific queries regarding RZV for rheumatic and musculoskeletal disease patients. However, it is essential to approach ChatGPT with caution due to risk of misinformation. This study emphasizes the importance of rigorously validating LLMs as a health information source.
Collapse
Affiliation(s)
- Akhil Sood
- From the Division of Immunology and Rheumatology, Stanford University School of Medicine, Palo Alto, CA
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Torre D, Schuwirth L. Programmatic assessment for learning: A programmatically designed assessment for the purpose of learning: AMEE Guide No. 174. MEDICAL TEACHER 2025; 47:918-933. [PMID: 39368061 DOI: 10.1080/0142159x.2024.2409936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Accepted: 09/24/2024] [Indexed: 10/07/2024]
Abstract
Programmatic assessment for learning (PAL) involves programmatically structured collection of assessment data for the purpose of learning. In this guide, we examine and provide recommendations on several aspects: First, we review the evolution that has led to the development of programmatic assessment, providing clarification of some of its terminology. Second, we outline the learning processes that guide the design of PAL, including distributed learning, interleaving, overlearning, and test-enhanced learning. Third, we review the evolving nature of validity and provide insights into validity from a program perspective. Finally, we examine opportunities, challenges, and future directions of assessment in the context of artificial intelligence.
Collapse
Affiliation(s)
- Dario Torre
- University of Central Florida College of Medicine, Orlando, FL, USA
| | - Lambert Schuwirth
- College of Medicine and Public Health, Flinders University, Adelaide, Australia
| |
Collapse
|
12
|
Rossi NA, Corona KK, Yoshiyasu Y, Hajiyev Y, Hughes CA, Pine HS. Comparative analysis of GPT-4 and Google Gemini's consistency with pediatric otolaryngology guidelines. Int J Pediatr Otorhinolaryngol 2025; 193:112336. [PMID: 40203537 DOI: 10.1016/j.ijporl.2025.112336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 03/25/2025] [Accepted: 04/02/2025] [Indexed: 04/11/2025]
Abstract
OBJECTIVE To evaluate the accuracy and completeness of large language models (LLMs) in interpreting pediatric otolaryngology guidelines. MATERIALS AND METHODS GPT-4 and Google Gemini were assessed on their responses to queries based on key action statements from three American Academy of Otolaryngology - Head and Neck Surgery Foundation (AAO-HNSF) clinical practice guidelines. Two independent reviewers evaluated responses using Likert scales for accuracy (1-5) and completeness (1-3). Inter-rater reliability was assessed with weighted Cohen's kappa. Statistical comparisons between models were performed using the Wilcoxon Signed-Rank Test. RESULTS Both models achieved high scores (GPT-4: accuracy 4.74, completeness 2.94; Google Gemini: accuracy 4.82, completeness 2.98). No significant difference was found in accuracy (p = 0.134), while completeness showed concordance (p = 0.34). AI responses often emphasized the importance of individualization and consulting healthcare professionals. CONCLUSION GPT-4 and Google Gemini demonstrated potential as assistive tools in pediatric otolaryngology. However, limitations exist, including pre-trained datasets and subjective evaluation methods. Continuous learning and model refinement are crucial for reliable clinical integration. AI should complement, not replace, human expertise. This study contributes to the exploration of LLMs in pediatric otolaryngology.
Collapse
Affiliation(s)
- Nicholas A Rossi
- Department of Otolaryngology, Nationwide Children's Hospital, Columbus, OH, USA.
| | - Kassandra K Corona
- Department of Otolaryngology, University of Texas Health Science Center at San Antonio, San Antonio, TX, USA
| | - Yuki Yoshiyasu
- Department of Otolaryngology, University of Texas Medical Branch, Galveston, TX, USA
| | - Yusif Hajiyev
- Department of Otolaryngology, University of Texas Medical Branch, Galveston, TX, USA
| | - Charles A Hughes
- Department of Otolaryngology, University of Texas Medical Branch, Galveston, TX, USA
| | - Harold S Pine
- Department of Otolaryngology, University of Texas Medical Branch, Galveston, TX, USA
| |
Collapse
|
13
|
von Bubnoff F, Werner J, Hebach NR, Chopra DA, Schubert MC. Medical Mistrust in Online Cancer Communities: A Large-Scale Analysis Across 10 Cancer Entities. Psychooncology 2025; 34:e70180. [PMID: 40448945 DOI: 10.1002/pon.70180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2024] [Revised: 03/17/2025] [Accepted: 05/06/2025] [Indexed: 06/02/2025]
Abstract
BACKGROUND Medical mistrust is a barrier to optimal cancer care. Analyzing social media posts where patients voice mistrust provides an opportunity to understand its variations and derive potential ways to address medical mistrust. AIMS To (1) identify the frequency of mistrust expression in cancer-related Reddit posts, (2) characterize mistrusted entities and reasons for mistrust, and (3) identify emotional tone associated with mistrust. METHODS 101,963 posts from 10 entity-specific cancer communities on the social media platform Reddit made before September 30, 2024, were analyzed using a Large Language Model (LLM, "gpt-4o-mini") in this cross-sectional study. Performance of the LLM was compared to human raters. Categories for mistrusted entities and reasons for mistrust were developed inductively by human evaluators. Subsequently, posts were assigned to these different categories by the LLM. RESULTS Of n = 101,963 posts analyzed, 19,159 posts (18.8%) were categorized as expressing mistrust, predominantly directed at healthcare professionals (n = 14,221, 74.2%). Most common reasons for mistrust were "disregard for patient concerns" (n = 8176, 42.7%), "perceived incompetence of medical management" (n = 4871, 25.4%), and problems in "communication" (n = 4060, 21.2%). Mistrust posts commonly contained "worried" (n = 5933, 31.0%), "concerned" (n = 3623, 18.9%) and "frustrated" (n = 3046, 15.9%) tones. CONCLUSIONS Expression of medical mistrust is prevalent in social media and is predominantly directed at healthcare professionals. Mistrust is frequently associated with dismissal of patients' symptoms or concerns, a perceived lack of thoroughness in clinical management and communication difficulties, suggesting these as key actionable areas to address medical mistrust in clinical practice.
Collapse
Affiliation(s)
- Fabian von Bubnoff
- Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany
- Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Johannes Werner
- Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany
- Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Nils R Hebach
- Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany
- Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Deepti A Chopra
- Department of Psychiatry, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Marc Cicero Schubert
- Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany
- Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| |
Collapse
|
14
|
Deng A, Chen W, Dai J, Jiang L, Chen Y, Chen Y, Jiang J, Rao M. Current application of ChatGPT in undergraduate nuclear medicine education: Taking Chongqing Medical University as an example. MEDICAL TEACHER 2025; 47:997-1003. [PMID: 39305476 DOI: 10.1080/0142159x.2024.2399673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 08/29/2024] [Indexed: 10/03/2024]
Abstract
BACKGROUND Nuclear Medicine(NM), as an inherently interdisciplinary field, integrates diverse scientific principles and advanced imaging techniques. The advent of ChatGPT, a large language model, opens new avenues for medical educational innovation. With its advanced natural language processing abilities and complex algorithms, ChatGPT holds the potential to substantially enrich medical education, particularly in NM. OBJECTIVE To investigate the current application of ChatGPT in undergraduate Nuclear Medicine Education(NME). METHODS Employing a mixed-methods sequential explanatory design, the research investigates the current status of NME, the use of ChatGPT and the attitude towards ChatGPT among teachers and students in the Second Clinical College of Chongqing Medical University. RESULTS The investigation yields several salient findings: (1) Students and educators in NM face numerous challenges in the learning process; (2) ChatGPT is found to possess significant applicability and potential benefits in NME; (3) There is a pronounced inclination among respondents to adopt ChatGPT, with a keen interest in its diverse applications within the educational sphere. CONCLUSION ChatGPT has been utilized to address the difficulties faced by undergraduates at Chongqing Medical University in NME, and has been applied in various aspects to assist learning. The findings of this survey may offer some insights into how ChatGPT can be integrated into practical medical education.
Collapse
Affiliation(s)
- Ailin Deng
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Wenyi Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Jinjie Dai
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Liu Jiang
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Yicai Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Yuhua Chen
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Clinical Medicine, Chongqing Medical University, Chongqing, China
| | - Jinyan Jiang
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Maohua Rao
- Department of Nuclear Medicine, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
| |
Collapse
|
15
|
McNamara MA, Hill BG, Schilling PL. The Challenges of Using ChatGPT for Clinical Decision Support in Orthopaedic Surgery: A Pilot Study. J Am Acad Orthop Surg 2025; 33:618-622. [PMID: 40153611 DOI: 10.5435/jaaos-d-24-01072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 02/09/2025] [Indexed: 03/30/2025] Open
Abstract
BACKGROUND Artificial intelligence (AI) technologies have recently exploded in both accessibility and applicability, including in health care. Although studies have demonstrated its ability to adequately answer simple patient issues or multiple-choice questions, its capacity for deeper complex decision making within health care is relatively untested. In this study, we aimed to delve into AI's ability to integrate multiple clinical data sources and produce a reasonable assessment and plan, specifically in the setting of an orthopaedic surgery consultant. METHODS Ten common fractures seen by orthopaedic surgeons in the emergency department were chosen. Consult notes from patients sustaining each of these fractures, seen at a level 1 academic trauma center between 2022 and 2023, were stripped of patient data. The history, physical examination, and imaging interpretations were then given to ChatGPT4 in raw and semistructured formats. The AI was asked to determine an assessment and plan as if it were an orthopaedic surgeon. The generated plans were then compared with the actual clinical course of the patient, as determined by our multispecialty trauma conference. RESULTS When given both raw and semistructured formats of clinical data, ChatGPT4 determined safe and reasonable plans that included the final clinical outcome of the patient scenario. Evaluating large language models is an ongoing field of research without an established quantitative rubric; therefore, our conclusions rely on subjective comparison. CONCLUSION When given history, physical examination, and imaging interpretations, ChatGPT is able to synthesize complex clinical data into a reasonable and most importantly safe assessment and plan for common fractures seen by orthopaedic surgeons. Evaluating large language models is an ongoing challenge; however, using actual clinical courses as a "benchmark" for comparison presents a possible avenue for further research.
Collapse
Affiliation(s)
- Michael A McNamara
- From the Department of Orthopedic Surgery, Dartmouth-Hitchcock Medical Center, One Medical Center Drive, Lebanon, NH
| | | | | |
Collapse
|
16
|
Yip HF, Li Z, Zhang L, Lyu A. Large Language Models in Integrative Medicine: Progress, Challenges, and Opportunities. J Evid Based Med 2025; 18:e70031. [PMID: 40384541 PMCID: PMC12086751 DOI: 10.1111/jebm.70031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/03/2025] [Revised: 04/11/2025] [Accepted: 05/05/2025] [Indexed: 05/20/2025]
Abstract
Integrating Traditional Chinese Medicine (TCM) and Modern Medicine faces significant barriers, including the absence of unified frameworks and standardized diagnostic criteria. While Large Language Models (LLMs) in Medicine hold transformative potential to bridge these gaps, their application in integrative medicine remains underexplored and methodologically fragmented. This review systematically examines LLMs' development, deployment, and challenges in harmonizing Modern and TCM practices while identifying actionable strategies to advance this emerging field. This review aimed to provide insight into the following aspects. First, it summarized the existing LLMs in the General Domain, Modern Medicine, and TCM from the perspective of their model structures, number of parameters and domain-specific training data. We highlighted the limitations of existing LLMs in integrative medicine tasks through benchmark experiments and the unique applications of LLMs in Integrative Medicine. We discussed the challenges during the development and proposed possible solutions to mitigate them. This review synthesizes technical insights with practical clinical considerations, providing a roadmap for leveraging LLMs to bridge TCM's empirical wisdom with modern medical systems. These AI-driven synergies could redefine personalized care, optimize therapeutic outcomes, and establish new standards for holistic healthcare innovation.
Collapse
Affiliation(s)
- Hiu Fung Yip
- School of Chinese MedicineHong Kong Baptist UniversityHong KongChina
- Institute of Systems Medicine and Health SciencesHong Kong Baptist UniversityHong KongChina
| | - Zeming Li
- Department of Computer ScienceHong Kong Baptist UniversityHong KongChina
| | - Lu Zhang
- Institute of Systems Medicine and Health SciencesHong Kong Baptist UniversityHong KongChina
- Department of Computer ScienceHong Kong Baptist UniversityHong KongChina
| | - Aiping Lyu
- School of Chinese MedicineHong Kong Baptist UniversityHong KongChina
- Institute of Systems Medicine and Health SciencesHong Kong Baptist UniversityHong KongChina
- Guandong‐Hong Kong‐Macau Joint Lab on Chinese Medicine and Immune Disease ResearchGuangzhouChina
| |
Collapse
|
17
|
Thompson RAM, Shah YB, Aguirre F, Stewart C, Lallas CD, Shah MS. Artificial Intelligence Use in Medical Education: Best Practices and Future Directions. Curr Urol Rep 2025; 26:45. [PMID: 40439780 PMCID: PMC12122599 DOI: 10.1007/s11934-025-01277-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/18/2025] [Indexed: 06/02/2025]
Abstract
PURPOSEOF REVIEW This review examines the various ways artificial intelligence (AI) has been utilized in medical education (MedEd)and presents ideas that will ethically and effectively leverage AI in enhancing the learning experience of medical trainees. RECENT FINDINGS AI has improved accessibility to learning material in a manner that engages the wider population. It has utility as a reference tool and can assist academic writing by generating outlines, summaries and identifying relevant reference articles. As AI is increasingly integrated into MedEd and practice, its regulation should become a priority to prevent drawbacks to the education of trainees. By involving physicians in AI design and development, we can best preserve the integrity, quality, and clinical relevance of AI-generated content. In adopting the best practices for AI use, we can maximize its benefits while preserving the ethical standards of MedEd with the goal of improving learning outcomes.
Collapse
Affiliation(s)
- Rasheed A M Thompson
- Department of Urology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, United States
| | - Yash B Shah
- Department of Urology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, United States
| | - Francisco Aguirre
- Department of Urology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, United States
| | - Courtney Stewart
- Department of Urology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, United States
| | - Costas D Lallas
- Department of Urology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, United States
| | - Mihir S Shah
- Department of Urology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, United States.
| |
Collapse
|
18
|
Solmonovich RL, Kouba I, Lee JY, Demertzis K, Blitz MJ. Physician awareness of, interest in, and current use of artificial intelligence large language model-based virtual assistants. PLoS One 2025; 20:e0320749. [PMID: 40435166 PMCID: PMC12118853 DOI: 10.1371/journal.pone.0320749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 02/20/2025] [Indexed: 06/01/2025] Open
Abstract
There is increasing medical interest and research regarding the potential of large language model-based virtual assistants in healthcare. It is important to understand physicians' interest in implementing these tools into clinical practice, so preceding education could be implemented to ensure appropriate and ethical use. We aimed to assess physician 1) awareness of, 2) interest in, and 3) current use of large language model-based virtual assistants for clinical practice and professional development and determine the specific applications of interest and use. Additionally, we wanted to determine associations with age, gender, and role. We conducted a cross-sectional study between 11/08-12/2023 via an anonymous web-based survey that was disseminated among physicians at a large NY healthcare network using snowball sampling. Descriptive and basic inferential statistics were performed. There were 562 respondents, largely males (55.7%), attending physicians (68.5%), and from nonsurgical specialties (67.4%). Most were aware of large language model chatbots (89.7%) and expressed interest (97.2%). Only a minority incorporated it into their practice (21%). Highest levels of interest were for journal review, patient education, and documentation/dictation (88.1-89.5%). The most frequently employed uses were medical information and education and study/research design. Females showed higher interest than males (99.2% vs. 95.5%, p = 0.011). Attendings were more aware of large language models (92.2% vs. 84.2%, p = 0.004), while trainees had increased rates of use (28.8% vs. 17.4%, p = 0.002). Use varied across age brackets, highest among 20-30 year olds (29.1% vs. 13.5%-23.4%, p = 0.018), except for documentation/dictation, where highest use was among the 41-50 year old group (10.5% vs. 2.6%-8.7%, p = 0.047). We concluded that physicians are interested in large language model-based virtual assistants, a minority are implementing it into their practice, and gender-, role-, and age-based disparities exist. As physicians continue to integrate large language models into their patient care and professional development, there is opportunity for research, education, and guidance to ensure an inclusive, responsible, and safe adoption.
Collapse
Affiliation(s)
- Rachel L. Solmonovich
- Northwell, New Hyde Park, New York, United States of America
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, New York; Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY, United States of America
| | - Insaf Kouba
- Northwell, New Hyde Park, New York, United States of America
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, New York; Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY, United States of America
| | - Ji Y. Lee
- Biostatistics Unit, Office of Academic Affairs, Northwell Health, New Hyde Park, New York, United States of America
| | - Kristen Demertzis
- Northwell, New Hyde Park, New York, United States of America
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, New York; Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY, United States of America
| | - Matthew J. Blitz
- Northwell, New Hyde Park, New York, United States of America
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, New York; Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY, United States of America
- Institute of Health Systems Science, Feinstein Institutes for Medical Research, Northwell Health, New Hyde Park, New York, United States of America
| |
Collapse
|
19
|
Bahar TS, Öcal O, Çetinkaya Yaprak A. Comparison of ChatGPT-4, Microsoft Copilot, and Google Gemini for Pediatric Ophthalmology Questions. J Pediatr Ophthalmol Strabismus 2025:1-7. [PMID: 40423505 DOI: 10.3928/01913913-20250404-03] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/28/2025]
Abstract
PURPOSE To evaluate the success of Chat Generative Pre-trained Transformer (ChatGPT; OpenAl), Google Gemini (Alphabet, Inc), and Microsoft Copilot (Microsoft Corporation) artificial intelligence (AI) programs, which are offered free of charge by three different manufacturers, in answering questions related to pediatric ophthalmology correctly and to investigate whether they are superior to each other. METHODS ChatGPT, Gemini, and Copilot were each asked 100 multiple-choice questions from the Ophtho-Questions online question bank, which is widely used for preparing for the high-stakes Ophthalmic Knowledge Evaluation Program examination. Their answers were compared to the official answer keys and categorized as correct or incorrect. The readability of the responses was assessed using the Flesch-Kincaid Grade Level, Flesch Reading Ease Score, and the Coleman-Liau Index. RESULTS ChatGPT, Gemini, and Copilot chatbots answered 61 (61%), 60 (60%), and 74 (74%) questions correctly, respectively. The Copilot AI program had a significantly higher rate of correct answers to questions than ChatGPT and Gemini (P = .049 and .035). Three readability analyses revealed that Copilot had the highest average score, followed by ChatGPT and Gemini, which were more challenging than the recommended level. CONCLUSIONS Although AI chatbots can serve as useful tools for acquiring information on pediatric ophthalmology, their responses should be interpreted with caution due to potential inaccuracies. [J Pediatr Ophthalmol Strabismus. 20XX;X(X):XXX-XXX.].
Collapse
|
20
|
Madi M, Araji T, Hazimeh D, Adra SW. Battle of the Bots: Assessing the Ability of Four Large Language Models to Tackle Different Surgery Topics. Am Surg 2025:31348251346538. [PMID: 40420550 DOI: 10.1177/00031348251346538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2025]
Abstract
ObjectiveOur study aims to compare the performance of different large language model chatbots on surgical questions of different topics and categories.Materials and MethodsFour different chatbots (ChatGPT 4.0, Medical Chat, Google Bard, and Copilot Ai) were used for our study. 114 multiple-choice surgical questions covering 9 different topics were entered into each chatbot, and their answers were recorded.ResultsThe performance of ChatGPT was significantly better than Bard (P < 0.0001) and Medical Chat (P = 0.0013) but not significantly better than Copilot (P = 0.9663). We also found a statistically significant difference in ENT (P = 0.0199) and GI (P = 0.0124) questions between each chatbot when we assessed their performances per surgical specialty. Finally, the mean scores of Bard, Copilot, Medical Chat, and ChatGPT 4.0 on the diagnosis questions were higher than those in the management questions. The difference was only statistically significant, however, for Bard (P = 0.0281).ConclusionOur study offers insight into the performance of different chatbots on surgery-related questions and topics. The strengths and shortcomings of each can provide us with a better understanding of how to use Chatbots in the surgical field, including surgical education.
Collapse
Affiliation(s)
- M Madi
- Department of Surgery, American University of Beirut Medical Center, Beirut, Lebanon
| | - T Araji
- Department of Surgery, Henry Ford Hospital, Detroit, MI, USA
| | - D Hazimeh
- Department of Obstetrics and Gynecology, Duke University Medical Center, Durham, NC, USA
| | - Souheil W Adra
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA, USA
| |
Collapse
|
21
|
Wang W, Fu J, Zhang Y, Hu K. A Comparative Analysis of GPT-4o and ERNIE Bot in a Chinese Radiation Oncology Exam. JOURNAL OF CANCER EDUCATION : THE OFFICIAL JOURNAL OF THE AMERICAN ASSOCIATION FOR CANCER EDUCATION 2025:10.1007/s13187-025-02652-9. [PMID: 40418520 DOI: 10.1007/s13187-025-02652-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Accepted: 05/11/2025] [Indexed: 05/27/2025]
Abstract
Large language models (LLMs) are increasingly utilized in medical education and practice, yet their application in niche fields such as radiation oncology remains underexplored. This study evaluates and compares the performance of OpenAI's GPT-4o and Baidu's ERNIE Bot in a Chinese-language radiation oncology examination. We employed the Chinese National Health Professional Technical Qualification Examination (Intermediate Level) for Radiation Oncology, using a question bank of 1128 items across four sections: Basic Knowledge, Relevant Knowledge, Specialized Knowledge, and Practice Competence. A passing score required an accuracy rate of 60% or higher in all sections. The models' responses were assessed for accuracy against standard answers, with key metrics including overall accuracy, section-specific performance, case analysis performance, and accuracy consensus between the models. The overall accuracy rates were 79.3% for GPT-4o and 76.9% for ERNIE Bot (p = 0.154). Across the four sections, GPT-4o achieved accuracy rates of 82.1%, 84.6%, 78.6%, and 60.9%, respectively, while ERNIE Bot achieved 81.6%, 73.9%, 77.9%, and 69.0%. In the Relevant Knowledge section, GPT-4o achieved significantly higher accuracy (p = 0.002), while no significant differences were found in the other three sections. Across various question types-including single-choice, multiple-answer, case analysis, non-case analysis, and different content areas of case analysis-both models exhibited satisfied accuracy, and ERNIE Bot achieved accuracy rates that were comparable to GPT-4o. The accuracy consensus between the two models was 84.5%, significantly exceeding the individual accuracy rates of GPT-4o (p = 0.003) and ERNIE Bot (p < 0.001). Both GPT-4o and ERNIE Bot successfully passed the highly specialized Chinese-language medical examination in radiation oncology and demonstrated comparable performance. This study provides valuable insights into the application of LLMs in Chinese medical education. These findings support the integration of LLMs in medical education and training within specialized, non-English-speaking contexts.
Collapse
Affiliation(s)
- Weiping Wang
- Department of Radiation Oncology, Peking Union Medical Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, No. 1 Shuaifuyuan Wangfujing, Beijing, 100730, China
| | - Jingxuan Fu
- Department of Clinical Laboratory, Xuanwu Hospital, Capital Medical University, Beijing, China
| | | | - Ke Hu
- Department of Radiation Oncology, Peking Union Medical Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, No. 1 Shuaifuyuan Wangfujing, Beijing, 100730, China.
| |
Collapse
|
22
|
Dashti M, Khosraviani F, Azimi T, Hefzi D, Ghasemi S, Fahimipour A, Zare N, Khurshid Z, Habib SR. Assessing ChatGPT-4's performance on the US prosthodontic exam: impact of fine-tuning and contextual prompting vs. base knowledge, a cross-sectional study. BMC MEDICAL EDUCATION 2025; 25:761. [PMID: 40410713 PMCID: PMC12102979 DOI: 10.1186/s12909-025-07371-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/27/2025] [Accepted: 05/19/2025] [Indexed: 05/25/2025]
Abstract
BACKGROUND Artificial intelligence (AI), such as ChatGPT-4 from OpenAI, has the potential to transform medical education and assessment. However, its effectiveness in specialized fields like prosthodontics, especially when comparing base to fine-tuned models, remains underexplored. This study evaluates the performance of ChatGPT-4 on the US National Prosthodontic Resident Mock Exam in its base form and after fine-tuning. The aim is to determine whether fine-tuning improves the AI's accuracy in answering specialized questions. METHODS An official sample questions from the 2021 US National Prosthodontic Resident Mock Exam was used, obtained from the American College of Prosthodontists. A total of 150 questions were initially considered, and resources were available for 106 questions. Both the base and fine-tuned models of ChatGPT-4 were tested under simulated exam conditions. Performance was assessed by comparing correct and incorrect responses. The Chi-square test was used to analyze accuracy, with significance set at p < 0.05. The Kappa coefficient was calculated to measure agreement between the models' responses. RESULTS The base model of ChatGPT-4 correctly answered 62.7% of the 150 questions. For the 106 questions with resources, the fine-tuned model answered 73.6% correctly. The Chi-square test showed a significant improvement in performance after fine-tuning (p < 0.001). The Kappa coefficient was 0.39, indicating moderate agreement between the models (p < 0.001). Performance varied by topic, with lower accuracy in areas such as Implant Prosthodontics, Removable Prosthodontics, and Occlusion, though the fine-tuned model consistently outperformed the base model. CONCLUSIONS Fine-tuning ChatGPT-4 with specific resources significantly enhances its accuracy in answering specialized prosthodontic exam questions. While the base model provides a solid baseline, fine-tuning is essential for improving AI performance in specialized fields. However, certain topics may require more targeted training to achieve higher accuracy.
Collapse
Affiliation(s)
- Mahmood Dashti
- Dentofacial Deformities Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences, Tajrish, District 1, Daneshjou Blvd, Tehran, Tehran Province, Iran.
| | | | - Tara Azimi
- Orofacial Pain and Disfunction, UCLA School of Dentistry, Los Angeles, CA, USA
| | - Delband Hefzi
- School of Dentistry, Tehran University of Medical Science, Tehran, Iran
| | - Shohreh Ghasemi
- Trauma and Craniofacial Reconstruction, Queen Mary College, London, England
| | - Amir Fahimipour
- Discipline of Oral Surgery, Medicine and Diagnostics, Faculty of Medicine and Health, Westmead Hospital, The University of Sydney, Sydney, NSW, 2145, Australia
| | - Niusha Zare
- Department of Operative Dentistry, University of Southern California, Los Angeles, USA
| | - Zohaib Khurshid
- Center of Excellence for Regenerative Dentistry, Department of Anatomy, Faculty of Dentistry, Chulalongkorn University, Bangkok, 10330, Thailand
| | - Syed Rashid Habib
- Department of Prosthetic Dental Sciences, College of Dentistry, King Saud University, P. O. Box 60169, King Abdullah Road, 11545, Riyadh, Saudi Arabia.
| |
Collapse
|
23
|
Dastani M, Mardaneh J, Rostamian M. Large language models' capabilities in responding to tuberculosis medical questions: testing ChatGPT, Gemini, and Copilot. Sci Rep 2025; 15:18004. [PMID: 40410343 PMCID: PMC12102205 DOI: 10.1038/s41598-025-03074-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2025] [Accepted: 05/19/2025] [Indexed: 05/25/2025] Open
Abstract
This study aims to evaluate the capability of Large Language Models (LLMs) in responding to questions related to tuberculosis. Three large language models (ChatGPT, Gemini, and Copilot) were selected based on public accessibility criteria and their ability to respond to medical questions. Questions were designed across four main domains (diagnosis, treatment, prevention and control, and disease management). The responses were subsequently evaluated using DISCERN-AI and NLAT-AI assessment tools. ChatGPT achieved higher scores (4 out of 5) across all domains, while Gemini demonstrated superior performance in specific areas such as prevention and control with a score of 4.4. Copilot showed the weakest performance in disease management with a score of 3.6. In the diagnosis domain, all three models demonstrated equivalent performance (4 out of 5). According to the DISCERN-AI criteria, ChatGPT excelled in information relevance but showed deficiencies in providing sources and information production dates. All three models exhibited similar performance in balance and objectivity indicators. While all three models demonstrate acceptable capabilities in responding to medical questions related to tuberculosis, they share common limitations such as insufficient source citation and failure to acknowledge response uncertainties. Enhancement of these models could strengthen their role in providing medical information.
Collapse
Affiliation(s)
- Meisam Dastani
- Infectious Diseases Research Center, Gonabad University of Medical Sciences, Gonabad, Iran
| | - Jalal Mardaneh
- Department of Microbiology, Infectious Diseases Research Center, School of Medicine, Gonabad University of Medical Sciences, Gonabad, Iran
| | - Morteza Rostamian
- English Department, School of Medicine, Gonabad University of Medical Sciences, Gonabad, Iran.
| |
Collapse
|
24
|
Hong Q, Liu S, Wu L, Lu Q, Yang P, Chen D, Rao G, Liu X, Ye H, Zhuang P, Yang W, Zeng S, Feng Q, Liu X, Cai J, Cheng S. Evaluating the performance of large language & visual-language models in cervical cytology screening. NPJ Precis Oncol 2025; 9:153. [PMID: 40410424 PMCID: PMC12102327 DOI: 10.1038/s41698-025-00916-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2025] [Accepted: 04/19/2025] [Indexed: 05/25/2025] Open
Abstract
Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.
Collapse
Affiliation(s)
- Qi Hong
- School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China
| | - Shijie Liu
- Britton Chance Center and MoE Key Laboratory for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China
| | - Liying Wu
- Department of Obstetrics and Gynecology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Qiqi Lu
- School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China
| | - Pinglan Yang
- Department of Obstetrics and Gynecology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Dingyu Chen
- Department of Obstetrics and Gynecology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Gong Rao
- Britton Chance Center and MoE Key Laboratory for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China
| | - Xinyi Liu
- School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China
| | - Hua Ye
- School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China
| | - Peiqi Zhuang
- School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China
| | - Wenxiu Yang
- School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China
| | - Shaoqun Zeng
- Britton Chance Center and MoE Key Laboratory for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China
| | - Qianjin Feng
- School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China
- Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou, 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou, 510515, China
| | - Xiuli Liu
- Britton Chance Center and MoE Key Laboratory for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China.
| | - Jing Cai
- Department of Obstetrics and Gynecology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
| | - Shenghua Cheng
- School of Biomedical Engineering, Southern Medical University, Guangzhou, 510515, China.
- Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou, 510515, China.
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou, 510515, China.
| |
Collapse
|
25
|
Lee JK, Park S, Hwang SH, Lee J, Cho D, Choi S. Comparative evaluation of six large language models in transfusion medicine: Addressing language and domain-specific challenges. Vox Sang 2025. [PMID: 40410122 DOI: 10.1111/vox.70050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2025] [Revised: 04/28/2025] [Accepted: 05/06/2025] [Indexed: 05/25/2025]
Abstract
BACKGROUND AND OBJECTIVES Large language models (LLMs) such as GPT-4 are increasingly utilized in clinical and educational settings; however, their validity in subspecialized domains like transfusion medicine remains insufficiently characterized. This study assessed the performance of six LLMs on transfusion-related questions from Korean national licensing examinations for medical doctors (MDs) and medical technologists (MTs). MATERIALS AND METHODS A total of 23 MD and 67 MT questions (2020-2023) were extracted from publicly available sources. All items were originally written in Korean and subsequently translated into English to evaluate cross-linguistic performance. Each model received standardized multiple-choice prompts (five options), and correctness was determined by explicit answer selection. Accuracy was calculated as the proportion of correct responses, with 0.75 designated as the performance threshold. Chi-square tests were employed to analyse language-based differences. RESULTS GPT-4 and GPT-4o consistently surpassed the 0.75 threshold across both languages and examination types. GPT-3.5 demonstrated reasonable accuracy in English but showed a marked decline in Korean, suggesting limitations in multilingual generalization. Gemini 1.5 outperformed Gemini 1, particularly in Korean, though both exhibited variability across technical subdomains. Clova X showed inconsistent results across settings. All models demonstrated limited performance in legal and ethical scenarios. CONCLUSION GPT-4 and GPT-4o exhibited robust and reliable performance across a range of transfusion medicine topics. Nonetheless, inter-model and inter-language variability highlights the need for targeted fine-tuning, particularly in the context of local regulatory and ethical frameworks, to support safe and context-appropriate implementation in clinical practice.
Collapse
Affiliation(s)
- Jong Kwon Lee
- Department of Laboratory Medicine and Genetics, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - Sholhui Park
- Department of Laboratory Medicine, Ewha Womans University, College of Medicine, Seoul, Republic of Korea
| | - Sang-Hyun Hwang
- Department of Laboratory Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Jaejoon Lee
- Department of Laboratory Medicine and Genetics, Soonchunhyang University Bucheon Hospital, Bucheon, Republic of Korea
| | - Duck Cho
- Department of Laboratory Medicine and Genetics, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - Sooin Choi
- Department of Laboratory Medicine and Genetics, Soonchunhyang University Bucheon Hospital, Bucheon, Republic of Korea
| |
Collapse
|
26
|
Edwards CJ, Cornelison B, Erstad BL. Comparison of a generative large language model to pharmacy student performance on therapeutics examinations. CURRENTS IN PHARMACY TEACHING & LEARNING 2025; 17:102394. [PMID: 40409210 DOI: 10.1016/j.cptl.2025.102394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Revised: 03/17/2025] [Accepted: 05/09/2025] [Indexed: 05/25/2025]
Abstract
OBJECTIVE To compare the performance of a generative language model (ChatGPT-3.5) to pharmacy students on therapeutics examinations. METHODS Questions were drawn from two pharmacotherapeutics courses in a 4-year PharmD program. Questions were classified as case based or non-case based and application or recall. Questions were entered into ChatGPT version 3.5 and responses were scored. ChatGPT's score for each exam was calculated by dividing the number of correct responses by the total number of questions. The mean composite score for ChatGPT was calculated by adding individual scores from each exam and dividing by the number of exams. The mean composite score for the students was calculated by dividing the sum of the mean class performance on each exam divided by the number of exams. Chi-square was used to identify factors associated with incorrect responses from ChatGPT. RESULTS The mean composite score across 6 exams for ChatGPT was 53 (SD = 19.2) compared to 82 (SD = 4) for the pharmacy students (p = 0.0048). ChatGPT answered 51 % of questions correctly. ChatGPT was less likely to answer application-based questions correctly compared to recall-based questions (44 % vs 80 %) and less likely to answer case-based questions correctly compared to non-case-based questions (45 % vs 74 %). CONCLUSION ChatGPT scored lower than the average grade for pharmacy students and was less likely to answer application-based and case-based questions correctly. These findings provide valuable insight into how this technology will perform which can help to inform best practices for item development and helps highlight the limitations of this technology.
Collapse
Affiliation(s)
- Christopher J Edwards
- Department of Pharmacy Practice & Science, University of Arizona R. Ken Coit College of Pharmacy, Tucson, AZ, United States of America.
| | - Bernadette Cornelison
- Department of Pharmacy Practice & Science, University of Arizona R. Ken Coit College of Pharmacy, Tucson, AZ, United States of America.
| | - Brian L Erstad
- Department of Pharmacy Practice & Science, University of Arizona R. Ken Coit College of Pharmacy, Tucson, AZ, United States of America.
| |
Collapse
|
27
|
Mutalik P, Cheung KH, Green J, Buelt-Gebhardt M, Anderson KF, Jeanpaul V, McDonald L, Wininger M, Li Y, Rajeevan N, Jessel PM, Moore H, Adabag S, Raitt MH, Aslan M. Combining Rule-based NLP-lite with Rapid Iterative Chart Adjudication for Creation of a Large, Accurately Curated Cohort from EHR data: A Case Study in the Context of a Clinical Trial Emulation. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:847-856. [PMID: 40417550 PMCID: PMC12099393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
The aim of this work was to create a gold-standard curated cohort of 10,000+ cases from the Veteran Affairs (VA) corporate data warehouse (CDW) for virtual emulation of a randomized clinical trial (CSP#592). The trial had six inclusion/exclusion criteria lacking adequate structured data. We therefore used a hybrid computer/human approach to extract information from clinical notes. Rule-based NLP output was iteratively adjudicated by a panel of trained non-clinician content experts and non-experts using an easy-to-use spreadsheet-based rapid adjudication display. This group-adjudication process iteratively sharpened both the computer algorithm and clinical decision criteria, while simultaneously training the non-experts. The cohort was successfully created with each inclusion/exclusion decision backed by a source document. Less than 0.5% of cases required referral to specialist clinicians. It is likely that such curated datasets capturing specialist reasoning and using a process-supervised approach will acquire greater importance as training tools for future clinical AI applications.
Collapse
Affiliation(s)
- Pradeep Mutalik
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA Connecticut Healthcare System, West Haven, CT
- Yale University School of Medicine, New Haven, CT
| | - Kei-Hoi Cheung
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA Connecticut Healthcare System, West Haven, CT
- Yale University School of Medicine, New Haven, CT
| | | | | | - Karen F Anderson
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA Connecticut Healthcare System, West Haven, CT
- Yale University School of Medicine, New Haven, CT
| | - Vales Jeanpaul
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA Connecticut Healthcare System, West Haven, CT
| | - Linda McDonald
- Cooperative Studies Program Coordinating Center, VA Connecticut Healthcare System, West Haven, Connecticut
| | - Michael Wininger
- Yale University School of Medicine, New Haven, CT
- Cooperative Studies Program Coordinating Center, VA Connecticut Healthcare System, West Haven, Connecticut
| | - Yuli Li
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA Connecticut Healthcare System, West Haven, CT
| | - Nallakkandi Rajeevan
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA Connecticut Healthcare System, West Haven, CT
| | - Peter M Jessel
- VA Portland Health Care System, Portland, OR
- Oregon Health and Sciences University, Portland, OR
| | - Hans Moore
- VA Washington DC Health Care, Washington, DC
- Georgetown University School of Medicine, Washington, DC
| | - Selçuk Adabag
- VA Minneapolis Health Care System, Minneapolis, MN
- University of Minnesota, Minneapolis, MN
| | - Merritt H Raitt
- VA Portland Health Care System, Portland, OR
- Oregon Health and Sciences University, Portland, OR
| | - Mihaela Aslan
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA Connecticut Healthcare System, West Haven, CT
- Yale University School of Medicine, New Haven, CT
| |
Collapse
|
28
|
Diamond CJ, Thate J, Withall JB, Lee RY, Cato K, Rossetti SC. Generative AI Demonstrated Difficulty Reasoning on Nursing Flowsheet Data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:349-358. [PMID: 40417556 PMCID: PMC12099445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Excessive documentation burden is linked to clinician burnout, thus motivating efforts to reduce burden. Generative artificial intelligence (AI) poses opportunities for burden reduction but requires rigorous assessment. We evaluated the ability of a large language model (LLM) (OpenAI's GPT-4) to interpret various intervention-response relationships presented on nursing flowsheets, assessing performance using MUC-5 evaluation metrics, and compared its assessments to those of nurse expert evaluators. ChatGPT correctly assessed 3 of 14 clinical scenarios, and partially correctly assessed 6 of 14, frequently omitting data from its reasoning. Nurse expert evaluators correctly assessed all relationships and provided additional language reflective of standard nursing practice beyond the intervention-response relationships evidenced in nursing flowsheets. Future work should ensure the training data used for electronic health record (EHR)-integrated LLMs includes all types of narrative nursing documentation that reflect nurses' clinical reasoning, and verification of LLM-based information summarization does not burden end-users.
Collapse
Affiliation(s)
| | - Jennifer Thate
- Columbia University Department of Biomedical Informatics, New York, NY
- Siena College, Loudonville, NY
| | | | - Rachel Y Lee
- Columbia University School of Nursing, New York, NY
| | - Kenrick Cato
- University of Pennsylvania School of Nursing, Philadelphia, PA
| | - Sarah C Rossetti
- Columbia University Department of Biomedical Informatics, New York, NY
- Columbia University School of Nursing, New York, NY
| |
Collapse
|
29
|
Bai X, Wang S, Zhao Y, Feng M, Ma W, Liu X. Application of AI Chatbot in Responding to Asynchronous Text-Based Messages From Patients With Cancer: Comparative Study. J Med Internet Res 2025; 27:e67462. [PMID: 40397947 DOI: 10.2196/67462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 12/22/2024] [Accepted: 04/14/2025] [Indexed: 05/23/2025] Open
Abstract
BACKGROUND Telemedicine, which incorporates artificial intelligence such as chatbots, offers significant potential for enhancing health care delivery. However, the efficacy of artificial intelligence chatbots compared to human physicians in clinical settings remains underexplored, particularly in complex scenarios involving patients with cancer and asynchronous text-based interactions. OBJECTIVE This study aimed to evaluate the performance of the GPT-4 (OpenAI) chatbot in responding to asynchronous text-based medical messages from patients with cancer by comparing its responses with those of physicians across two clinical scenarios: patient education and medical decision-making. METHODS We collected 4257 deidentified asynchronous text-based medical consultation records from 17 oncologists across China between January 1, 2020, and March 31, 2024. Each record included patient questions, demographic data, and disease-related details. The records were categorized into two scenarios: patient education (eg, symptom explanations and test interpretations) and medical decision-making (eg, treatment planning). The GPT-4 chatbot was used to simulate physician responses to these records, with each session conducted in a new conversation to avoid cross-session interference. The chatbot responses, along with the original physician responses, were evaluated by a medical review panel (3 oncologists) and a patient panel (20 patients with cancer). The medical panel assessed completeness, accuracy, and safety using a 3-level scale, whereas the patient panel rated completeness, trustworthiness, and empathy on a 5-point ordinal scale. Statistical analyses included chi-square tests for categorical variables and Wilcoxon signed-rank tests for ordinal ratings. RESULTS In the patient education scenario (n=2364), the chatbot scored higher than physicians in completeness (n=2301, 97.34% vs n=2213, 93.61% for fully complete responses; P=.002), with no significant differences in accuracy or safety (P>.05). In the medical decision-making scenario (n=1893), the chatbot exhibited lower accuracy (n=1834, 96.88% vs n=1855, 97.99% for fully accurate responses; P<.001) and trustworthiness (n=860, 50.71% vs n=1766, 93.29% rated as "Moderately trustworthy" or higher; P<.001) compared with physicians. Regarding empathy, the medical review panel rated the chatbot as demonstrating higher empathy scores across both scenarios, whereas the patient review panel reached the opposite conclusion, consistently favoring physicians in empathetic communication. Errors in chatbot responses were primarily due to misinterpretations of medical terminology or the lack of updated guidelines, with 3.12% (59/1893) of its responses potentially leading to adverse outcomes, compared with 2.01% (38/1893) for physicians. CONCLUSIONS The GPT-4 chatbot performs comparably to physicians in patient education by providing comprehensive and empathetic responses. However, its reliability in medical decision-making remains limited, particularly in complex scenarios requiring nuanced clinical judgment. These findings underscore the chatbot's potential as a supplementary tool in telemedicine while highlighting the need for physician oversight to ensure patient safety and accuracy.
Collapse
Affiliation(s)
- Xuexue Bai
- Department of Neurosurgery, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Shiyong Wang
- Department of Neurosurgery, First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Yuanli Zhao
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Ming Feng
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Wenbin Ma
- Department of Neurosurgery, Peking Union Medical College Hospital, Beijing, China
| | - Xiaomin Liu
- Head and Neck Neuro-Oncology Center, Tianjin Huanhu Hospital, Tianjin, China
| |
Collapse
|
30
|
Scholich T, Barr M, Wiltsey Stirman S, Raj S. A Comparison of Responses from Human Therapists and Large Language Model-Based Chatbots to Assess Therapeutic Communication: Mixed Methods Study. JMIR Ment Health 2025; 12:e69709. [PMID: 40397927 DOI: 10.2196/69709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 03/25/2025] [Accepted: 03/30/2025] [Indexed: 05/23/2025] Open
Abstract
BACKGROUND Consumers are increasingly using large language model-based chatbots to seek mental health advice or intervention due to ease of access and limited availability of mental health professionals. However, their suitability and safety for mental health applications remain underexplored, particularly in comparison to professional therapeutic practices. OBJECTIVE This study aimed to evaluate how general-purpose chatbots respond to mental health scenarios and compare their responses to those provided by licensed therapists. Specifically, we sought to identify chatbots' strengths and limitations, as well as the ethical and practical considerations necessary for their use in mental health care. METHODS We conducted a mixed methods study to compare responses from chatbots and licensed therapists to scripted mental health scenarios. We created 2 fictional scenarios and prompted 3 chatbots to create 6 interaction logs. We recruited 17 therapists and conducted study sessions that consisted of 3 activities. First, therapists responded to the 2 scenarios using a Qualtrics form. Second, therapists went through the 6 interaction logs using a think-aloud procedure to highlight their thoughts about the chatbots' responses. Finally, we conducted a semistructured interview to explore subjective opinions on the use of chatbots for supporting mental health. The study sessions were analyzed using thematic analysis. The interaction logs from chatbot and therapist responses were coded using the Multitheoretical List of Therapeutic Interventions codes and then compared to each other. RESULTS We identified 7 themes describing the strengths and limitations of the chatbots as compared to therapists. These include elements of good therapy in chatbot responses, conversational style of chatbots, insufficient inquiry and feedback seeking by chatbots, chatbot interventions, client engagement, chatbots' responses to crisis situations, and considerations for chatbot-based therapy. In the use of Multitheoretical List of Therapeutic Interventions codes, we found that therapists evoked more elaboration (Mann-Whitney U=9; P=.001) and used more self-disclosure (U=45.5; P=.37) as compared to the chatbots. The chatbots used affirming (U=28; P=.045) and reassuring (U=23; P=.02) language more often than the therapists. The chatbots also used psychoeducation (U=22.5; P=.02) and suggestions (U=12.5; P=.003) more often than the therapists. CONCLUSIONS Our study demonstrates the unsuitability of general-purpose chatbots to safely engage in mental health conversations, particularly in crisis situations. While chatbots display elements of good therapy, such as validation and reassurance, overuse of directive advice without sufficient inquiry and use of generic interventions make them unsuitable as therapeutic agents. Careful research and evaluation will be necessary to determine the impact of chatbot interactions and to identify the most appropriate use cases related to mental health.
Collapse
Affiliation(s)
- Till Scholich
- Institute for Human-Centered AI, Stanford University, Stanford, CA, United States
| | - Maya Barr
- PGSP-Stanford PsyD Consortium, Palo Alto University, Palo Alto, CA, United States
| | - Shannon Wiltsey Stirman
- Dissemination and Training Division, National Center for PTSD, Menlo Park, CA, United States
- Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, United States
| | - Shriti Raj
- Institute for Human-Centered AI, Stanford University, Stanford, CA, United States
- Department of Medicine Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States
| |
Collapse
|
31
|
Fushimi A, Terada M, Tahara R, Nakazawa Y, Iwase M, Shibayama T, Kotti S, Yamashita N, Iesato A. Assessing the quality of Japanese online breast cancer treatment information using large language models: a comparison of ChatGPT, Claude, and expert evaluations. Breast Cancer 2025:10.1007/s12282-025-01719-1. [PMID: 40399592 DOI: 10.1007/s12282-025-01719-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Accepted: 05/03/2025] [Indexed: 05/23/2025]
Abstract
BACKGROUND The internet is a primary source of health information for breast cancer patients, but online content quality varies widely. This study aimed to evaluate the capability of large language models (LLMs), including ChatGPT and Claude, to assess the quality of online Japanese breast cancer treatment information by calculating and comparing their DISCERN scores with those of expert raters. METHODS We analyzed 60 Japanese web pages on breast cancer treatments (surgery, chemotherapy, immunotherapy) using the DISCERN instrument. Each page was evaluated by the LLMs ChatGPT and Claude, along with two expert raters. We assessed LLMs evaluation consistency, correlations between LLMs and expert assessments, and relationships between DISCERN scores, Google search rankings, and content length. RESULTS Evaluations by LLMs showed high consistency and moderate to strong correlations with expert assessments (ChatGPT vs Expert: r = 0.65; Claude vs Expert: r = 0.68). LLMs assigned slightly higher scores than expert raters. Chemotherapy pages received the highest quality scores, followed by surgery and immunotherapy. We found a weak negative correlation between Google search ranking and DISCERN scores, and a moderate positive correlation (r = 0.45) between content length and quality ratings. CONCLUSIONS This study demonstrates the potential of LLM-assisted evaluation in assessing online health information quality, while highlighting the importance of human expertise. LLMs could efficiently process large volumes of health information but should complement human insight for comprehensive assessments. These findings have implications for improving the accessibility and reliability of breast cancer treatment information.
Collapse
Affiliation(s)
- Atsushi Fushimi
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan.
- Department of Surgery, The Jikei University School of Medicine, 3-25-8 Nishi-Shimbashi, Minato-ku, Tokyo, 105-8461, Japan.
| | - Mitsuo Terada
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan
- Department of Breast Surgery, Nagoya City University Graduate School of Medical Sciences, 1-Kawasumi, Mizuho-cho, Mizuho-ku, Nagoya, Aichi, 467-8602, Japan
| | - Rie Tahara
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan
| | - Yuko Nakazawa
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan
- Department of General Surgical Science, Graduate School of Medicine, Gunma University, 3-39-22 Showa-machi, Maebashi City, Gunma, 371-8511, Japan
| | - Madoka Iwase
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan
- Department of Breast and Endocrine Surgery, Nagoya University Hospital, 65 Tsurumai-cho, Showa-ku, Nagoya, Aichi, 466-8560, Japan
| | - Tomoko Shibayama
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan
- Machida Ekimae Breast Clinic, 6-21-32 Haramachida, Machida, Tokyo, 194-0013, Japan
| | - Samy Kotti
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan
- Division of Hematology-Oncology, University of Pittsburgh Medical Center (UPMC) Hillman Cancer Center, 5115 Centre Avenue, Pittsburgh, PA, 15232, USA
| | - Nami Yamashita
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan
- Breast Oncology Center, Breast Surgical Oncology, Cancer Institute Hospital of Japanese Foundation for Cancer Research, 3-8-31, Ariake, Koto-ku, Tokyo, 135-8550, Japan
| | - Asumi Iesato
- General Incorporated Association, BC TUBE, 1-5-6 Kudan-minami, Chiyoda-ku, Tokyo, 102-0074, Japan
- NEXT-Ganken program, Japanese Foundation for Cancer Research, 3-8-31, Ariake, Koto-ku, Tokyo, 135-8550, Japan
| |
Collapse
|
32
|
Hanss K, Sarma KV, Glowinski AL, Krystal A, Saunders R, Halls A, Gorrell S, Reilly E. Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study. J Med Internet Res 2025; 27:e69910. [PMID: 40392576 DOI: 10.2196/69910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 03/16/2025] [Accepted: 04/28/2025] [Indexed: 05/22/2025] Open
Abstract
BACKGROUND Large language models (LLMs), such as OpenAI's GPT-3.5, GPT-4, and GPT-4o, have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric "knowledge" stored within the parameters of these models and developing measures of confidence in their responses (ie, the likelihood that an LLM response is accurate) are crucial for the safe and effective integration of these tools into mental health settings. OBJECTIVE This study aimed to assess the accuracy, reliability, and predictors of accuracy of GPT-3.5 (175 billion parameters), GPT-4 (approximately 1.8 trillion parameters), and GPT-4o (an optimized version of GPT-4 with unknown parameters) with standardized psychiatry multiple-choice questions (MCQs). METHODS A cross-sectional study was conducted where 3 commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to single-answer MCQs (N=150) extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ 10 times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (1) response consistency to MCQs across 10 trials (reliability), (2) the correlation between MCQ answer accuracy and response consistency, and (3) the correlation between MCQ answer accuracy and model self-reported confidence. RESULTS On the first attempt, GPT-3.5 answered 58.0% (87/150) of MCQs correctly, while GPT-4 and GPT-4o answered 84.0% (126/150) and 87.3% (131/150) correctly, respectively. GPT-4 and GPT-4o showed no difference in performance (P=.51), but they significantly outperformed GPT-3.5 (P<.001). GPT-3.5 exhibited less response consistency on average compared to the other models (P<.001). MCQ response consistency was positively correlated with MCQ accuracy across all models (r=0.340, 0.682, and 0.590 for GPT-3.5, GPT-4, and GPT-4o, respectively; all P<.001), whereas model self-reported confidence showed no correlation with accuracy in the models, except for GPT-3.5, where self-reported confidence was weakly inversely correlated with accuracy (P<.001). CONCLUSIONS To our knowledge, this is the first comprehensive evaluation of the general psychiatric knowledge encoded in commercially available LLMs and the first study to assess their reliability and identify predictors of response accuracy within medical domains. The findings suggest that GPT-4 and GPT-4o encode accurate and reliable general psychiatric knowledge and that methods, such as repeated prompting, may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.
Collapse
Affiliation(s)
- Kaitlin Hanss
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| | - Karthik V Sarma
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| | - Anne L Glowinski
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| | - Andrew Krystal
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| | - Ramotse Saunders
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| | - Andrew Halls
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| | - Sasha Gorrell
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| | - Erin Reilly
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| |
Collapse
|
33
|
Cross J, Kayalackakom T, Robinson RE, Vaughans A, Sebastian R, Hood R, Lewis C, Devaraju S, Honnavar P, Naik S, Joseph J, Anand N, Mohammed A, Johnson A, Cohen E, Adeniji T, Nnenna Nnaji A, George JE. Assessing ChatGPT's Capability as a New Age Standardized Patient: Qualitative Study. JMIR MEDICAL EDUCATION 2025; 11:e63353. [PMID: 40393017 DOI: 10.2196/63353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 03/18/2025] [Accepted: 03/18/2025] [Indexed: 05/22/2025]
Abstract
Background Standardized patients (SPs) have been crucial in medical education, offering realistic patient interactions to students. Despite their benefits, SP training is resource-intensive and access can be limited. Advances in artificial intelligence (AI), particularly with large language models such as ChatGPT, present new opportunities for virtual SPs, potentially addressing these limitations. objectives This study aims to assess medical students' perceptions and experiences of using ChatGPT as an SP and to evaluate ChatGPT's effectiveness in performing as a virtual SP in a medical school setting. Methods This qualitative study, approved by the American University of Antigua Institutional Review Board, involved 9 students (5 females and 4 males, aged 22-48 years) from the American University of Antigua College of Medicine. Students were observed during a live role-play, interacting with ChatGPT as an SP using a predetermined prompt. A structured 15-question survey was administered before and after the interaction. Thematic analysis was conducted on the transcribed and coded responses, with inductive category formation. Results Thematic analysis identified key themes preinteraction including technology limitations (eg, prompt engineering difficulties), learning efficacy (eg, potential for personalized learning and reduced interview stress), verisimilitude (eg, absence of visual cues), and trust (eg, concerns about AI accuracy). Postinteraction, students noted improvements in prompt engineering, some alignment issues (eg, limited responses on sensitive topics), maintained learning efficacy (eg, convenience and repetition), and continued verisimilitude challenges (eg, lack of empathy and nonverbal cues). No significant trust issues were reported postinteraction. Despite some limitations, students found ChatGPT as a valuable supplement to traditional SPs, enhancing practice flexibility and diagnostic skills. Conclusions ChatGPT can effectively augment traditional SPs in medical education, offering accessible, flexible practice opportunities. However, it cannot fully replace human SPs due to limitations in verisimilitude and prompt engineering challenges. Integrating prompt engineering into medical curricula and continuous advancements in AI are recommended to enhance the use of virtual SPs.
Collapse
Affiliation(s)
- Joseph Cross
- Medical University of the Americas, PO Box 701, Charlestown, Saint Kitts and Nevis, 1 9788629500 ext 364
| | - Tarron Kayalackakom
- Department of Education Enhancement, College of Medicine, American University of Antigua, St Johns, Antigua and Barbuda
| | - Raymond E Robinson
- Department of Health Informatics, School of Professional Studies, Northwestern University, Evanston, IL, United States
| | - Andrea Vaughans
- Department of Biochemistry, Cell Biology and Genetics, College of Medicine, American University of Antigua, Basseterre, Antigua and Barbuda
| | - Roopa Sebastian
- Department of Biochemistry, Cell Biology and Genetics, College of Medicine, American University of Antigua, Basseterre, Antigua and Barbuda
| | - Ricardo Hood
- Department of Education Enhancement, College of Medicine, American University of Antigua, St Johns, Antigua and Barbuda
| | - Courtney Lewis
- Department of Education Enhancement, College of Medicine, American University of Antigua, St Johns, Antigua and Barbuda
| | - Sumanth Devaraju
- Department of Education Enhancement, College of Medicine, American University of Antigua, St Johns, Antigua and Barbuda
| | - Prasanna Honnavar
- Department of Education Enhancement, College of Medicine, American University of Antigua, St Johns, Antigua and Barbuda
| | - Sheetal Naik
- Department of Education Enhancement, College of Medicine, American University of Antigua, St Johns, Antigua and Barbuda
| | - Jillwin Joseph
- Department of Education Enhancement, College of Medicine, American University of Antigua, St Johns, Antigua and Barbuda
| | - Nikhilesh Anand
- Department of Medical Education, School of Medicine, University of Texas Rio Grande Valley, Edinburgh, TX, United States
| | | | - Asjah Johnson
- School of Medicine, Xavier University, Orangestad, Aruba
| | - Eliran Cohen
- School of Medicine, Xavier University, Orangestad, Aruba
| | | | | | | |
Collapse
|
34
|
Mirza A, Alampara N, Kunchapu S, Ríos-García M, Emoekabu B, Krishnan A, Gupta T, Schilling-Wilhelmi M, Okereke M, Aneesh A, Asgari M, Eberhardt J, Elahi AM, Elbeheiry HM, Gil MV, Glaubitz C, Greiner M, Holick CT, Hoffmann T, Ibrahim A, Klepsch LC, Köster Y, Kreth FA, Meyer J, Miret S, Peschel JM, Ringleb M, Roesner NC, Schreiber J, Schubert US, Stafast LM, Wonanke ADD, Pieler M, Schwaller P, Jablonka KM. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat Chem 2025:10.1038/s41557-025-01815-x. [PMID: 40394186 DOI: 10.1038/s41557-025-01815-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Accepted: 03/26/2025] [Indexed: 05/22/2025]
Abstract
Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question-answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs' impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.
Collapse
Affiliation(s)
- Adrian Mirza
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Jena, Germany
| | - Nawaf Alampara
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Sreekanth Kunchapu
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Martiño Ríos-García
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Institute of Carbon Science and Technology, CSIC, Oviedo, Spain
| | - Benedict Emoekabu
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | | | - Tanya Gupta
- Laboratory of Artificial Chemical Intelligence, Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
- National Centre of Competence in Research Catalysis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Mara Schilling-Wilhelmi
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Macjonathan Okereke
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Anagha Aneesh
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Mehrdad Asgari
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK
| | | | - Amir Mohammad Elahi
- Laboratory of Molecular Simulation, Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne, Sion, Switzerland
| | - Hani M Elbeheiry
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | | | | | - Maximilian Greiner
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Caroline T Holick
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Tim Hoffmann
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Abdelrahman Ibrahim
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Lea C Klepsch
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Yannik Köster
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Fabian Alexander Kreth
- Institute for Technical Chemistry and Environmental Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Center for Energy and Environmental Chemistry Jena, Friedrich Schiller University Jena, Jena, Germany
| | - Jakob Meyer
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | | | - Jan Matthias Peschel
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Michael Ringleb
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Nicole C Roesner
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Johanna Schreiber
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Ulrich S Schubert
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
- Center for Energy and Environmental Chemistry Jena, Friedrich Schiller University Jena, Jena, Germany
| | - Leanne M Stafast
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - A D Dinga Wonanke
- Theoretical Chemistry, Technische Universität Dresden, Dresden, Germany
| | | | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence, Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
- National Centre of Competence in Research Catalysis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Kevin Maik Jablonka
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany.
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Jena, Germany.
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany.
- Center for Energy and Environmental Chemistry Jena, Friedrich Schiller University Jena, Jena, Germany.
| |
Collapse
|
35
|
Binaljadm TM, Alqutaibi AY, Halboub E, Zafar MS, Saker S. Artificial Intelligence Chatbots as Sources of Implant Dentistry Information for the Public: Validity and Reliability Assessment. Eur J Dent 2025. [PMID: 40393663 DOI: 10.1055/s-0045-1809155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2025] Open
Abstract
This study assessed the reliability and validity of responses from three chatbot systems-OpenAI's GPT-3.5, Gemini, and Copilot-concerning frequently asked questions (FAQs) in implant dentistry posed by patients.Twenty FAQs were prompted to three chatbots in three different times utilizing their respective application programming interfaces. The responses were assessed for validity (low and high threshold) and reliability by two prosthodontic consultants using a five-point Likert scale.The test of normality was utilized using the Shapiro-Wilk test. Differences between different chatbots regarding the quantitative variables in a given (fixed) time point and between the same chatbots in different time points were assessed using Friedman's two-way analysis of variance by ranks, followed by pairwise comparisons. All statistical analyses were conducted using the SPSS (Statistical Package for Social Sciences) Version 26.0 software program.GPT-3.5 provided the longest responses, while Gemini was the most concise. All chatbots advised consulting dental professionals more frequently. Validity was high under the low-threshold test but low under the high-threshold test, with Copilot scoring the highest. Reliability was high for all, with Gemini achieving perfect consistency.Chatbots showed consistent and generally valid responses with some variability in accuracy and details. While the chatbots demonstrated a high degree of reliability, their validity-especially under high-threshold criterion-remains limited. Improvements in accuracy and comprehensiveness are necessary for more effective use in providing information about dental implants.
Collapse
Affiliation(s)
- Tahani Mohammed Binaljadm
- Department of Substitutive Dental Sciences (Prosthodontics), College of Dentistry, Taibah University, Al Madinah, Saudi Arabia
| | - Ahmed Yaseen Alqutaibi
- Department of Substitutive Dental Sciences (Prosthodontics), College of Dentistry, Taibah University, Al Madinah, Saudi Arabia
- Department of Prosthodontics, College of Dentistry, Ibb University, Ibb, Yemen
| | - Esam Halboub
- Department of Maxillofacial Surgery and Diagnostic Science, College of Dentistry, Jazan University, Jazan, Saudi Arabia
| | - Muhammad Sohail Zafar
- Department of Clinical Sciences, College of Dentistry, Ajman University, Ajman, United Arab Emirates
- Centre of Medical and Bio-allied Health Sciences Research, Ajman University, Ajman, United Arab Emirates
- School of Dentistry, Jordan University, Amman, Jordan
| | - Samah Saker
- Department of Substitutive Dental Sciences (Prosthodontics), College of Dentistry, Taibah University, Al Madinah, Saudi Arabia
| |
Collapse
|
36
|
Runyon C. Using large language models (LLMs) to apply analytic rubrics to score post-encounter notes. MEDICAL TEACHER 2025:1-9. [PMID: 40380943 DOI: 10.1080/0142159x.2025.2504106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Accepted: 05/06/2025] [Indexed: 05/19/2025]
Abstract
BACKGROUND Large language models (LLMs) show promise in medical education. This study examines LLMs' ability to score post-encounter notes (PNs) from Objective Structured Clinical Examinations (OSCEs) using an analytic rubric. The goal was to evaluate and refine methods for accurate, consistent scoring. METHODS Seven LLMs scored five PNs representing varying levels of performance, including an intentionally incorrect PN. An iterative experimental design tested different prompting strategies and temperature settings, a parameter controlling LLM response creativity. Scores were compared to expected rubric-based results. RESULTS Consistently accurate scoring required multiple rounds of prompt refinement. Simple prompting led to high variability, which improved with structured approaches and low-temperature settings. LLMs occasionally made errors calculating total scores, necessitating external calculation. The final approach yielded consistently accurate scores across all models. CONCLUSIONS LLMs can reliably apply analytic rubrics to PNs with careful prompt engineering and process refinement. This study illustrates their potential as scalable, automated scoring tools in medical education, though further research is needed to explore their use with holistic rubrics. These findings demonstrate the utility of LLMs in assessment practices.
Collapse
|
37
|
Lin KH, Kao TH, Wang LC, Kuo CT, Chen PCH, Chu YC, Yeh YC. Benchmarking large language models GPT-4o, llama 3.1, and qwen 2.5 for cancer genetic variant classification. NPJ Precis Oncol 2025; 9:141. [PMID: 40369023 PMCID: PMC12078457 DOI: 10.1038/s41698-025-00935-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2025] [Accepted: 05/02/2025] [Indexed: 05/16/2025] Open
Abstract
Classifying cancer genetic variants based on clinical actionability is crucial yet challenging in precision oncology. Large language models (LLMs) offer potential solutions, but their performance remains underexplored. This study evaluates GPT-4o, Llama 3.1, and Qwen 2.5 in classifying genetic variants from the OncoKB and CIViC databases, as well as a real-world dataset derived from FoundationOne CDx reports. GPT-4o achieved the highest accuracy (0.7318) in distinguishing clinically relevant variants from variants of unknown clinical significance (VUS), outperforming Qwen 2.5 (0.5731) and Llama 3.1 (0.4976). LLMs demonstrated better concordance with expert annotations for variants with strong clinical evidence but exhibited greater inconsistencies for those with weaker evidence. All three models showed a tendency to assign variants to higher evidence levels, suggesting a propensity for overclassification. Prompt engineering significantly improved accuracy, while retrieval-augmented generation (RAG) further enhanced performance. Stability analysis across 100 iterations revealed greater consistency with the CIViC system than with OncoKB. These findings highlight the promise of LLMs in cancer genetic variant classification while underscoring the need for further optimization to improve accuracy, consistency, and clinical applicability.
Collapse
Affiliation(s)
- Kuan-Hsun Lin
- Department of Information Management, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei, Taiwan, ROC
| | - Tzu-Hang Kao
- Department of Pathology and Laboratory Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
| | - Lei-Chi Wang
- Department of Pathology and Laboratory Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC
| | - Chen-Tsung Kuo
- Department of Information Management, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei, Taiwan, ROC
| | - Paul Chih-Hsueh Chen
- Department of Pathology and Laboratory Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC
| | - Yuan-Chia Chu
- Department of Information Management, Taipei Veterans General Hospital, Taipei, Taiwan, ROC.
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei, Taiwan, ROC.
- Big Data Center, Taipei Veterans General Hospital, Taipei, Taiwan, ROC.
| | - Yi-Chen Yeh
- Department of Pathology and Laboratory Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC.
- School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC.
| |
Collapse
|
38
|
Ozdag Y, Mahmoud M, Klena JC, Grandizio LC. Artificial Intelligence in Personal Statements Within Orthopaedic Surgery Residency Applications. J Am Acad Orthop Surg 2025; 33:554-560. [PMID: 40101179 DOI: 10.5435/jaaos-d-24-01285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/03/2024] [Accepted: 01/28/2025] [Indexed: 03/20/2025] Open
Abstract
PURPOSE Artificial intelligence (AI) has been increasingly studied within medical education and clinical practice. At present, it remains uncertain if AI is being used to write personal statements (PSs) for orthopaedic surgery residency applications. Our purpose was to analyze PS that were submitted to our institution and determine the rate of AI utilization within these texts. METHODS Four groups were created for comparison: 100 PS submitted before the release of ChatGTP (PRE-PS), 100 PS submitted after Chat Generative Pre-Trained Transformers introduction (POST-PS), 10 AI-generated PS (AI-PS), and 10 hybrid PS (H-PS), which contained both human-generated and AI-generated text. For each of the four groups, AI detection software (GPT-Zero) was used to quantify the percentage of human-generated text, "mixed" text, and AI-generated text. In addition, the detection software provided level of confidence (highly confident, moderately confident, uncertain) with respect to the "final verdict" of human-generated versus AI-generated text. RESULTS The percentage of human-generated text in the PRE-PS, POST-PS, H-PS, and AI-PS groups were 94%, 93%, 28%, and 0% respectively. All 200 PS (100%) submitted to our program had a final verdict of "human" with verdict confidence of >90%. By contrast, all AI-generated statements (H-PS and AI-PS groups) had a final verdict of "AI." Verdict confidence for the AI-PS group was 100%. CONCLUSION Orthopaedic surgery residency applicants do not appear, at present, to be using AI to create PS included in their applications. AI detection software (GPTZero) appears to be able to accurately detect human-generated and AI-generated PSs for orthopaedic residency applications. Considering the increasing role and development of AI software, future investigations should endeavor to explore if these results change over time. Similar to orthopaedic journals, guidelines should be established that pertain to the use of AI on postgraduate training applications. LEVEL OF EVIDENCE V-Nonclinical.
Collapse
Affiliation(s)
- Yagiz Ozdag
- From the Department of Orthopaedic Surgery, Geisinger Commonwealth School of Medicine, Geisinger Musculoskeletal Institute, Danville, PA
| | | | | | | |
Collapse
|
39
|
Riedel M, Meyer B, Kfuri Rubens R, Riedel C, Amann N, Kiechle M, Riedel F. AI-driven simplification of surgical reports in gynecologic oncology: A potential tool for patient education. Acta Obstet Gynecol Scand 2025. [PMID: 40366215 DOI: 10.1111/aogs.15123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Revised: 02/17/2025] [Accepted: 03/10/2025] [Indexed: 05/15/2025]
Abstract
INTRODUCTION The emergence of large language models heralds a new chapter in natural language processing, with immense potential for improving medical care and especially medical oncology. One recent and publicly available example is Generative Pretraining Transformer 4 (GPT-4). Our objective was to evaluate its ability to rephrase original surgical reports into simplified versions that are more comprehensible to patients. Specifically, we aimed to investigate and discuss the potential, limitations, and associated risks of using these simplified reports for patient education and information in gynecologic oncology. MATERIAL AND METHODS We tasked GPT-4 with generating simplified versions from n = 20 original gynecologic surgical reports. Patients were provided with both their original report and the corresponding simplified version generated by GPT-4. Alongside these reports, patients received questionnaires designed to facilitate a comparative assessment between the original and simplified surgical reports. Furthermore, clinical experts evaluated the artificial intelligence (AI)-generated reports with regard to their accuracy and clinical quality. RESULTS The simplified surgical reports generated by GPT-4 significantly improved our patients' understanding, particularly with regard to the surgical procedure, its outcome, and potential risks. However, despite the reports being more accessible and relevant, clinical experts highlighted concerns about their lack of medical precision. CONCLUSIONS Advanced language models like GPT-4 can transform unedited surgical reports to improve clarity about the procedure and its outcomes. It offers considerable promise for enhancing patient education. However, concerns about medical precision underscore the need for rigorous oversight to safely integrate AI into patient education. Over the medium term, AI-generated, simplified versions of these reports-and other medical records-could be effortlessly integrated into standard automated postoperative care and digital discharge systems.
Collapse
Affiliation(s)
- Maximilian Riedel
- Department of Gynecology and Obstetrics, TUM University Hospital, Technical University Munich (TU), Munich, Germany
| | - Bastian Meyer
- Department of Gynecology and Obstetrics, TUM University Hospital, Technical University Munich (TU), Munich, Germany
| | - Raphael Kfuri Rubens
- Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
- Department of Medicine III, Hematology and Oncology, TUM University Hospital, Technical University Munich, Munich, Germany
- TUM School of Medicine, Technical University of Munich, Munich, Germany
| | - Caroline Riedel
- Department of General Internal Medicine and Psychosomatics, Heidelberg University Hospital, Heidelberg, Germany
| | - Niklas Amann
- Department of Gynecology and Obstetrics, Friedrich-Alexander-University Erlangen-Nuremberg (FAU), Erlangen, Germany
| | - Marion Kiechle
- Department of Gynecology and Obstetrics, TUM University Hospital, Technical University Munich (TU), Munich, Germany
| | - Fabian Riedel
- Department of Gynecology and Obstetrics, Heidelberg University Hospital, Heidelberg, Germany
| |
Collapse
|
40
|
Ozkan E, Tekin A, Ozkan MC, Cabrera D, Niven A, Dong Y. Global Health care Professionals' Perceptions of Large Language Model Use In Practice: Cross-Sectional Survey Study. JMIR MEDICAL EDUCATION 2025; 11:e58801. [PMID: 40354644 PMCID: PMC12088617 DOI: 10.2196/58801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 04/11/2025] [Accepted: 04/19/2025] [Indexed: 05/14/2025]
Abstract
Background ChatGPT is a large language model-based chatbot developed by OpenAI. ChatGPT has many potential applications to health care, including enhanced diagnostic accuracy and efficiency, improved treatment planning, and better patient outcomes. However, health care professionals' perceptions of ChatGPT and similar artificial intelligence tools are not well known. Understanding these attitudes is important to inform the best approaches to exploring their use in medicine. Objective Our aim was to evaluate the health care professionals' awareness and perceptions regarding potential applications of ChatGPT in the medical field, including potential benefits and challenges of adoption. Methods We designed a 33-question online survey that was distributed among health care professionals via targeted emails and professional Twitter and LinkedIn accounts. The survey included a range of questions to define respondents' demographic characteristics, familiarity with ChatGPT, perceptions of this tool's usefulness and reliability, and opinions on its potential to improve patient care, research, and education efforts. Results One hundred and fifteen health care professionals from 21 countries responded to the survey, including physicians, nurses, researchers, and educators. Of these, 101 (87.8%) had heard of ChatGPT, mainly from peers, social media, and news, and 77 (76.2%) had used ChatGPT at least once. Participants found ChatGPT to be helpful for writing manuscripts (n=31, 45.6%), emails (n=25, 36.8%), and grants (n=12, 17.6%); accessing the latest research and evidence-based guidelines (n=21, 30.9%); providing suggestions on diagnosis or treatment (n=15, 22.1%); and improving patient communication (n=12, 17.6%). Respondents also felt that the ability of ChatGPT to access and summarize research articles (n=22, 46.8%), provide quick answers to clinical questions (n=15, 31.9%), and generate patient education materials (n=10, 21.3%) was helpful. However, there are concerns regarding the use of ChatGPT, for example, the accuracy of responses (n=14, 29.8%), limited applicability in specific practices (n=18, 38.3%), and legal and ethical considerations (n=6, 12.8%), mainly related to plagiarism or copyright violations. Participants stated that safety protocols such as data encryption (n=63, 62.4%) and access control (n=52, 51.5%) could assist in ensuring patient privacy and data security. Conclusions Our findings show that ChatGPT use is widespread among health care professionals in daily clinical, research, and educational activities. The majority of our participants found ChatGPT to be useful; however, there are concerns about patient privacy, data security, and its legal and ethical issues as well as the accuracy of its information. Further studies are required to understand the impact of ChatGPT and other large language models on clinical, educational, and research outcomes, and the concerns regarding its use must be addressed systematically and through appropriate methods.
Collapse
Affiliation(s)
- Ecem Ozkan
- Department of Medicine, Jersey Shore University Medical Center, 1945 NJ-33, Neptune, NJ, 07753, United States, 1 5078843064
| | - Aysun Tekin
- Department of Anesthesiology, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Mahmut Can Ozkan
- Department of Medicine, Jersey Shore University Medical Center, 1945 NJ-33, Neptune, NJ, 07753, United States, 1 5078843064
| | - Daniel Cabrera
- Department of Emergency Medicine, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Alexander Niven
- Department of Pulmonary and Critical Care Medicine, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Yue Dong
- Department of Anesthesiology, Mayo Clinic College of Medicine, Rochester, MN, United States
| |
Collapse
|
41
|
Sanli AN, Tekcan Sanli DE, Karabulut A. Can American Board of Surgery in Training Examinations be passed by Large Language Models? Comparative assessment of Gemini, Copilot, and ChatGPT. Am Surg 2025:31348251341956. [PMID: 40353502 DOI: 10.1177/00031348251341956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2025]
Abstract
ObjectiveThis study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).MethodsMultiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.ResultsThe correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs (P < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower (P = 0.005 and P = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower (P < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference (P = 0.236).ConclusionIn the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.
Collapse
Affiliation(s)
- Ahmet Necati Sanli
- Department of General Surgery, Abdulkadir Yuksel State Hospital, Gaziantep, Turkey
| | | | - Ali Karabulut
- Department of General Surgery, Bagcilar Training and Research, University of Health Sciences, Istanbul, Turkey
| |
Collapse
|
42
|
Luo Y, Jiao M, Fotedar N, Ding JE, Karakis I, Rao VR, Asmar M, Xian X, Aboud O, Wen Y, Lin JJ, Hung FM, Sun H, Rosenow F, Liu F. Clinical Value of ChatGPT for Epilepsy Presurgical Decision-Making: Systematic Evaluation of Seizure Semiology Interpretation. J Med Internet Res 2025; 27:e69173. [PMID: 40354107 PMCID: PMC12107199 DOI: 10.2196/69173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2024] [Revised: 02/03/2025] [Accepted: 03/10/2025] [Indexed: 05/14/2025] Open
Abstract
BACKGROUND For patients with drug-resistant focal epilepsy, surgical resection of the epileptogenic zone (EZ) is an effective treatment to control seizures. Accurate localization of the EZ is crucial and is typically achieved through comprehensive presurgical approaches such as seizure semiology interpretation, electroencephalography (EEG), magnetic resonance imaging (MRI), and intracranial EEG (iEEG). However, interpreting seizure semiology is challenging because it heavily relies on expert knowledge. The semiologies are often inconsistent and incoherent, leading to variability and potential limitations in presurgical evaluation. To overcome these challenges, advanced technologies like large language models (LLMs)-with ChatGPT being a notable example-offer valuable tools for analyzing complex textual information, making them well-suited to interpret detailed seizure semiology descriptions and accurately localize the EZ. OBJECTIVE This study evaluates the clinical value of ChatGPT for interpreting seizure semiology to localize EZs in presurgical assessments for patients with focal epilepsy and compares its performance with that of epileptologists. METHODS We compiled 2 data cohorts: a publicly sourced cohort of 852 semiology-EZ pairs from 193 peer-reviewed journal publications and a private cohort of 184 semiology-EZ pairs collected from Far Eastern Memorial Hospital (FEMH) in Taiwan. ChatGPT was evaluated to predict the most likely EZ locations using 2 prompt methods: zero-shot prompting (ZSP) and few-shot prompting (FSP). To compare the performance of ChatGPT, 8 epileptologists were recruited to participate in an online survey to interpret 100 randomly selected semiology records. The responses from ChatGPT and epileptologists were compared using 3 metrics: regional sensitivity (RSens), weighted sensitivity (WSens), and net positive inference rate (NPIR). RESULTS In the publicly sourced cohort, ChatGPT demonstrated high RSens reliability, achieving 80% to 90% for the frontal and temporal lobes; 20% to 40% for the parietal lobe, occipital lobe, and insular cortex; and only 3% for the cingulate cortex. The WSens, which accounts for biased data distribution, consistently exceeded 67%, while the mean NPIR remained around 0. These evaluation results based on the private FEMH cohort are consistent with those from the publicly sourced cohort. A group t test with 1000 bootstrap samples revealed that ChatGPT-4 significantly outperformed epileptologists in RSens for the most frequently implicated EZs, such as the frontal and temporal lobes (P<.001). Additionally, ChatGPT-4 demonstrated superior overall performance in WSens (P<.001). However, no significant differences were observed between ChatGPT and the epileptologists in NPIR, highlighting comparable performance in this metric. CONCLUSIONS ChatGPT demonstrated clinical value as a tool to assist decision-making during epilepsy preoperative workups. With ongoing advancements in LLMs, their reliability and accuracy are anticipated to improve.
Collapse
Affiliation(s)
- Yaxi Luo
- Department of Computer Science, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Meng Jiao
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Neel Fotedar
- School of Medicine, Case Western Reserve University, Cleveland, OH, United States
- Department of Neurology, University Hospitals Cleveland Medical Center, Cleveland, OH, United States
| | - Jun-En Ding
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
| | - Ioannis Karakis
- Department of Neurology, School of Medicine, Emory University, Atlanta, GA, United States
- Department of Neurology, School of Medicine, University of Crete, Heraklion, Greece
| | - Vikram R Rao
- Department of Neurology and Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - Melissa Asmar
- Department of Neurology, University of California, Davis, Davis, CA, United States
| | - Xiaochen Xian
- H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Orwa Aboud
- Department of Neurology and Neurological Surgery, University of California, Davis, Davis, CA, United States
| | - Yuxin Wen
- Fowler School of Engineering, Chapman University, Orange, CA, United States
| | - Jack J Lin
- Department of Neurology, University of California, Davis, Davis, CA, United States
| | - Fang-Ming Hung
- Center of Artificial Intelligence, Far Eastern Memorial Hospital, New Taipei City, Taiwan
- Surgical Trauma Intensive Care Unit, Far Eastern Memorial Hospital, New Taipei City, Taiwan
| | - Hai Sun
- Department of Neurosurgery, Rutgers Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, United States
| | - Felix Rosenow
- Department of Neurology, Epilepsy Center Frankfurt Rhine-Main, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Feng Liu
- Department of Systems and Enterprises, Schaefer School of Engineering & Science, Stevens Institute of Technology, Hoboken, NJ, United States
- Semcer Center for Healthcare Innovation, Stevens Institute of Technology, Hoboken, NJ, United States
| |
Collapse
|
43
|
An R. Artificial intelligence in health and sport sciences: Promise, progress, and prudence. JOURNAL OF SPORT AND HEALTH SCIENCE 2025:101054. [PMID: 40349842 DOI: 10.1016/j.jshs.2025.101054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2025] [Accepted: 04/28/2025] [Indexed: 05/14/2025]
Affiliation(s)
- Ruopeng An
- Silver School of Social Work, New York University, New York, NY, 10012, USA.
| |
Collapse
|
44
|
Digiacomo A, Orsini A, Cicchetti R, Spadano L, De Santis S, Di Sessa L, Vitale M, Di Nicola M, Tamborino F, Basconi M, De Archangelis R, Salzano G, Dello Stritto G, Lannutti P, Schips L, Marchioni M. Chatgpt vs traditional pedagogy: a comparative study in urological learning. World J Urol 2025; 43:286. [PMID: 40338279 PMCID: PMC12062111 DOI: 10.1007/s00345-025-05654-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2024] [Accepted: 04/22/2025] [Indexed: 05/09/2025] Open
Abstract
PURPOSE Technological evolution is radically changing medical learning models. We evaluated the learning outcomes of urological concepts using ChatGPT, traditional lecture and combined approach. METHODS We conducted a randomized triple-blind study on 121 medical students with no previous formal curriculum in urology. Students were randomly divided into three study classes with different learning methods: ChatGPT, Lecture and ChatGPT + Lecture. The "adrenal glands" were randomly extracted as the subject of the lessons. Students were evaluated using a thirty-question test. RESULTS The evaluation test median score was higher for students who underwent ChatGPT + Lecture compared with those who had only ChatGPT (10 vs. 12, p = 0.007). Such differences remained statistically significant also in multivariable models adjusting according to year of course, gender and previous ChatGPT experience (estimate: 2.6, p-value = 0.002). For most of the questions (about 70%), the proportion of students correctly answering was higher in the ChatGPT + Lecture learning groups than in the other groups. CONCLUSION ChatGPT loses its potential if used without a previous background. The limits of scientific reliability persist and a teacher-guided method is still essential. ChatGPT + traditional lecture gives more effective results than the single traditional lecture also allowing a better use of the chatbot.
Collapse
Affiliation(s)
- Alessio Digiacomo
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy.
| | - Angelo Orsini
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | - Rossella Cicchetti
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | | | - Sara De Santis
- Italian Secretariat for Medical Students (SISM), Chieti, Italy
| | - Laura Di Sessa
- Italian Secretariat for Medical Students (SISM), Chieti, Italy
| | - Miriana Vitale
- Italian Secretariat for Medical Students (SISM), Chieti, Italy
| | - Marta Di Nicola
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, Chieti, Italy
| | - Flavia Tamborino
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | - Martina Basconi
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | - Riccardo De Archangelis
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | - Gaetano Salzano
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | - Guglielmo Dello Stritto
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | - Peppino Lannutti
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | - Luigi Schips
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| | - Michele Marchioni
- Department of Medical, Oral and Biotechnological Sciences, "G. d'Annunzio" University of Chieti-Pescara, SS Annunziata Hospital, Urology Unit, Chieti, Italy
| |
Collapse
|
45
|
Yitzhaki S, Peled N, Kaplan E, Kadmon G, Nahum E, Gendler Y, Weissbach A. Comparing ChatGPT-4 and a Paediatric Intensive Care Specialist in Responding to Medical Education Questions: A Multicenter Evaluation. J Paediatr Child Health 2025. [PMID: 40331496 DOI: 10.1111/jpc.70080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/31/2025] [Revised: 03/19/2025] [Accepted: 04/26/2025] [Indexed: 05/08/2025]
Abstract
OBJECTIVE To compare the performance of the Generative Pre-trained Transformer model 4 (ChatGPT-4) with that of a paediatric intensive care unit (PICU) specialist in responding to open-ended medical education questions. METHODS A comparative analysis was conducted using 100 educational questions sourced from a PICU trainee WhatsApp forum, covering factual knowledge and clinical reasoning. Ten PICU specialists from multiple tertiary paediatric centres independently evaluated 20 sets of paired responses from ChatGPT-4 and a PICU specialist (the original respondent to the forum questions), assessing overall superiority, completeness, accuracy, and integration potential. RESULTS After excluding one question requiring a visual aid, 198 paired evaluations were made (96 factual knowledge and 102 clinical reasoning). ChatGPT-4's responses were significantly longer than those of the PICU specialist (median words: 189 vs. 41; p < 0.0001). ChatGPT-4 was preferred in 60% of factual knowledge comparisons (p < 0.001), while the PICU specialist's responses were preferred in 67% of clinical reasoning comparisons (p < 0.0001). ChatGPT-4 demonstrated superior completeness in factual knowledge (p = 0.02) but lower accuracy in clinical reasoning (p < 0.0001). Integration of both answers was favoured in 37% of cases (95% CI, 31%-44%). CONCLUSIONS ChatGPT-4 shows promise as a tool for factual medical education in the PICU, excelling in completeness. However, it requires oversight in clinical reasoning tasks, where the PICU specialist's responses remain superior. Expert review is essential before using ChatGPT-4 independently in PICU education and in other similarly underexplored medical fields.
Collapse
Affiliation(s)
- Shai Yitzhaki
- Department of Pediatrics A, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Nadav Peled
- Adelson School of Medicine, Ariel University, Ariel, Israel
| | - Eytan Kaplan
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Pediatric Intensive Care Unit, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
| | - Gili Kadmon
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Pediatric Intensive Care Unit, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
| | - Elhanan Nahum
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Pediatric Intensive Care Unit, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
| | - Yulia Gendler
- Department of Nursing at the School of Health Sciences, Ariel University, Ariel, Israel
| | - Avichai Weissbach
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Pediatric Intensive Care Unit, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
| |
Collapse
|
46
|
Ostrovsky AM. Language models in radiology: Early promise, transparent limitations, and the path ahead. Am J Emerg Med 2025:S0735-6757(25)00315-8. [PMID: 40345907 DOI: 10.1016/j.ajem.2025.05.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2025] [Accepted: 05/03/2025] [Indexed: 05/11/2025] Open
Affiliation(s)
- Adam M Ostrovsky
- Sidney Kimmel Medical College at Thomas Jefferson University, 1025 Walnut St., Philadelphia, PA 19107, USA.
| |
Collapse
|
47
|
Erdemir AG. Reader Comment Regarding Evaluating a large language model's accuracy in chest X-ray interpretation for acute thoracic conditions. Am J Emerg Med 2025:S0735-6757(25)00316-X. [PMID: 40348657 DOI: 10.1016/j.ajem.2025.05.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2025] [Revised: 04/11/2025] [Accepted: 05/03/2025] [Indexed: 05/14/2025] Open
Affiliation(s)
- Ahmet Gürkan Erdemir
- Department of Radiology, Faculty of Medicine, Hacettepe University, Ankara, Türkiye.
| |
Collapse
|
48
|
Buhl LK. The answer may vary: large language model response patterns challenge their use in test item analysis. MEDICAL TEACHER 2025:1-6. [PMID: 40319392 DOI: 10.1080/0142159x.2025.2497891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2025] [Accepted: 04/22/2025] [Indexed: 05/07/2025]
Abstract
INTRODUCTION The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population. METHODS Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows. RESULTS Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28-0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices. DISCUSSION These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM's response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.
Collapse
Affiliation(s)
- Lauren K Buhl
- Department of Anesthesiology, Dartmouth Hitchcock Medical Center, Lebanon, NH, USA
| |
Collapse
|
49
|
Bitterman J, D'Angelo A, Holachek A, Eubanks JE. Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions. PM R 2025. [PMID: 40318209 DOI: 10.1002/pmrj.13386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Revised: 01/16/2025] [Accepted: 02/17/2025] [Indexed: 05/07/2025]
Abstract
BACKGROUND There have been significant advances in machine learning and artificial intelligence technology over the past few years, leading to the release of large language models (LLMs) such as ChatGPT. There are many potential applications for LLMs in health care, but it is critical to first determine how accurate LLMs are before putting them into practice. No studies have evaluated the accuracy and precision of LLMs in responding to questions related to the field of physical medicine and rehabilitation (PM&R). OBJECTIVE To determine the accuracy and precision of two OpenAI LLMs (GPT-3.5, released in November 2022, and GPT-4o, released in May 2024) in answering questions related to PM&R knowledge. DESIGN Cross-sectional study. Both LLMs were tested on the same 744 PM&R knowledge questions that covered all aspects of the field (general rehabilitation, stroke, traumatic brain injury, spinal cord injury, musculoskeletal medicine, pain medicine, electrodiagnostic medicine, pediatric rehabilitation, prosthetics and orthotics, rheumatology, and pharmacology). Each LLM was tested three times on the same question set to assess for precision. SETTING N/A. PATIENTS N/A. INTERVENTIONS N/A. MAIN OUTCOME MEASURE Percentage of correctly answered questions. RESULTS For three runs of the 744-question set, GPT-3.5 answered 56.3%, 56.5%, and 56.9% of the questions correctly. For three runs of the same question set, GPT-4o answered 83.6%, 84%, and 84.1% of the questions correctly. GPT-4o outperformed GPT-3.5 in all subcategories of PM&R questions. CONCLUSIONS LLM technology is rapidly advancing, with the more recent GPT-4o model performing much better on PM&R knowledge questions compared to GPT-3.5. There is potential for LLMs in augmenting clinical practice, medical training, and patient education. However, the technology has limitations and physicians should remain cautious in using it in practice at this time.
Collapse
Affiliation(s)
- Jason Bitterman
- Division of Physical Medicine and Rehabilitation, Hartford Healthcare Medical Group, Hartford, Connecticut, USA
| | - Alexander D'Angelo
- Nebraska Medicine Department of Physical Medicine and Rehabilitation, University of Nebraska Medical Center, Omaha, Nebraska, USA
| | | | - James E Eubanks
- Department of Orthopedics and Physical Medicine, Division of Physical Medicine and Rehabilitation, Medical University of South Carolina (MUSC), Charleston, South Carolina, USA
- Department of Physical Medicine and Rehabilitation, University of Pittsburgh Medical Center (UPMC), Pittsburgh, Pennsylvania, USA
| |
Collapse
|
50
|
Kim H, Hwang H, Lee J, Park S, Kim D, Lee T, Yoon C, Sohn J, Park J, Reykhart O, Fetherston T, Choi D, Kwak SH, Chen Q, Kang J. Small language models learn enhanced reasoning skills from medical textbooks. NPJ Digit Med 2025; 8:240. [PMID: 40316765 PMCID: PMC12048634 DOI: 10.1038/s41746-025-01653-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2025] [Accepted: 04/19/2025] [Indexed: 05/04/2025] Open
Abstract
Small language models (SLM) offer promise for medical applications by addressing the privacy and hardware constraints of large language models; however, their limited parameters (often fewer than ten billion) hinder multi-step reasoning for complex medical tasks. This study presents Meerkat, a new family of medical SLMs designed to be lightweight while enhancing reasoning capabilities. We begin by designing an effective and efficient training method. This involves extracting high-quality chain-of-thought reasoning paths from 18 medical textbooks, which are then combined with diverse instruction-following datasets within the medical domain, totaling 441K training examples. Fine-tuning was conducted on open-source SLMs using this curated dataset. Our Meerkat-7B and Meerkat-8B models outperformed their counterparts by 22.3% and 10.6% across six exam datasets, respectively. They also improved scores on the NEJM Case Challenge from 7 to 16 and from 13 to 20, surpassing the human score of 13.7. Additionally, they demonstrated superiority in expert evaluations, excelling in all metrics-completeness, factuality, clarity, and logical consistency-of reasoning abilities.
Collapse
Grants
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- HR20C0021(3) Ministry of Health & Welfare, Republic of Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- IITP-2024-2020-0-0181 Ministry of Science and ICT, South Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- NRF-2023R1A2C3004176 National Research Foundation of Korea
- Ministry of Health & Welfare, Republic of Korea
Collapse
Affiliation(s)
| | | | - Jiwoo Lee
- Korea University, Seoul, Republic of Korea
| | | | - Dain Kim
- Korea University, Seoul, Republic of Korea
| | | | | | | | | | | | | | | | - Soo Heon Kwak
- Seoul National University Hospital, Seoul, Republic of Korea
| | | | - Jaewoo Kang
- Korea University, Seoul, Republic of Korea.
- AIGEN Sciences, Seoul, Republic of Korea.
| |
Collapse
|