Published online Aug 26, 2025. doi: 10.4330/wjc.v17.i8.110489
Revised: June 13, 2025
Accepted: July 31, 2025
Published online: August 26, 2025
Processing time: 74 Days and 22.4 Hours
The integration of sophisticated large language models (LLMs) into healthcare has recently garnered significant attention due to their ability to leverage deep learning techniques to process vast datasets and generate contextually accurate, human-like responses. These models have been previously applied in medical diagnostics, such as in the evaluation of oral lesions. Given the high rate of missed diagnoses in pericarditis, LLMs may support clinicians in generating differential diagnoses-particularly in atypical cases where risk stratification and early identification are critical to preventing serious complications such as constrictive pericarditis and pericardial tamponade.
To compare the accuracy of LLMs in assisting the diagnosis of pericarditis as risk stratification tools.
A PubMed search was conducted using the keyword “pericarditis”, applying filters for “case reports”. Data from relevant cases were extracted. Inclusion criteria consisted of English-language reports involving patients aged 18 years or older with a confirmed diagnosis of acute pericarditis. The diagnostic capabilities of ChatGPT o1 and DeepThink-R1 were assessed by evaluating whether pericarditis was included in the top three differential diagnoses and as the sole provisional diagnosis. Each case was classified as either “yes” or “no” for inclusion.
From the initial search, 220 studies were identified, of which 16 case reports met the inclusion criteria. In assessing risk stratification for acute pericarditis, ChatGPT o1 correctly identified the condition in 10 of 16 cases (62.5%) in the differential diagnosis and in 8 of 16 cases (50.0%) as the provisional diagnosis. DeepThink-R1 identified it in 8 of 16 cases (50.0%) and 6 of 16 cases (37.5%), respectively. ChatGPT o1 demonstrated higher accuracy than DeepThink-R1 in identifying pericarditis.
Further research with larger sample sizes and optimized prompt engineering is warranted to improve diagnostic accuracy, particularly in atypical presentations.
Core Tip: This study evaluates the capabilities of large language models (LLMs), ChatGPT o1 and DeepSeek-R1, in the risk stratification of acute pericarditis, where delayed diagnosis may lead to significant complications. While both LLMs show similar performance and promise as supportive tools in identifying high-risk presentations, their current limitations in recognizing atypical symptom profiles underscore the need for further refinement. Future research should focus on improving model sensitivity to demographic and clinical variability to ensure broader applicability and safety in real-world settings.
- Citation: Goyal A, Sulaiman SA, Alaarag A, Hoshan W, Goyal P, Shah V, Daoud M, Mahalwar G, Sheikh AB. Comparison of ChatGPT and DeepSeek large language models in the diagnosis of pericarditis. World J Cardiol 2025; 17(8): 110489
- URL: https://www.wjgnet.com/1949-8462/full/v17/i8/110489.htm
- DOI: https://dx.doi.org/10.4330/wjc.v17.i8.110489
The utilization of sophisticated large language models (LLMs) has recently garnered huge attention considering their ability to harness deep learning techniques to process vast datasets, enabling them to generate contextually accurate and human-like responses[1]. While the use of ChatGPT and DeepSeek is well-established in areas such as data management, their integration into healthcare has been more restrained due to challenges in accommodating cultural and contextual factors that may influence clinical practices and outcomes, as well as inaccuracies when dealing with rare pathologies or atypical patient presentations[2]. Atypical presentations in pericarditis include patients where classic symptoms (chest pain, pericardial friction rub, and ECG changes) are not present or are altered, such as in patients exhibiting ST segment abnormalities or distributions other than typical deviations, or an absence of ST segment deviations[3]. Previous studies explored the use of LLMs in assisting with medical diagnostics in fields such as radiology and dermatology in the diagnosis of oral lesions[1], however their application in medicine varies from one field to another with the differences seen in the presentation of patients and clinical guidelines.
Given the high rate of missed diagnoses in pericarditis, as highlighted by Liu et al[2], LLMs may assist physicians in generating differential diagnoses, particularly in atypical cases, where risk stratification and early diagnosis are critical in mitigating serious complications such as constrictive pericarditis and pericardial tamponade[2]. Their accessibility, affordability, and widespread availability on mobile and desktop platforms further position them as promising tools for improving diagnostic accuracy[1].
Therefore, as ChatGPT o1 and DeepThink-R1, two advanced LLM models, show promising potential in assisting physicians with diagnosing diseases that have high rates of missed or incorrect diagnoses, this study aims to compare and contrast their accuracy in assisting the diagnosis of pericarditis as risk stratification tools.
A comprehensive PubMed search using the keyword “pericarditis” was conducted, applying filters for "case reports" and restricting the publication date to 2024-2025 to ensure novelty. Data from relevant cases were extracted, including key symptoms, vital signs, physical exam findings, ECG results, and final diagnoses, and were organized in Microsoft Excel. Diagnosis of acute pericarditis was validated for each case based on the data provided. Inclusion criteria consisted of English-written reports of patients aged 18 or older with a confirmed acute pericarditis diagnosis. Only case reports on acute pericarditis were included to avoid the heterogeneity of presentation in other types of pericarditis to confound our findings.
All prompts followed a standardized format beginning with the sentence: “We will provide the demographics and case presentation, including history and physical examination, of a patient presenting to a hospital. Provide us with your top three differential diagnoses and a single provisional diagnosis based on this information”, followed by the case data presented in a consistent sequence. The risk stratification performance of ChatGPT o1 and DeepThink-R1 was evaluated based on whether pericarditis was included in the top three differential diagnoses and/or identified as the single provisional diagnosis. Each case was labeled “yes” or “no” for pericarditis inclusion in each category. Fisher’s exact test was performed separately to assess the statistical significance of differences between the models in both differential diagnosis and provisional diagnosis classifications.
From the search, 220 studies were identified during preliminary screening, with 16 case reports meeting our selection criteria. In assessing risk stratification for acute pericarditis, ChatGPT o1 correctly identified the condition in 10 of 16 cases (62.5%) in the differential diagnosis and 8 of 16 cases (50.0%) in the provisional diagnosis, while DeepThink-R1 identified it in 8 of 16 cases (50.0%) and 6 of 16 cases (37.5%), respectively as shown in Supplementary Table 1.
McNemar’s test was applied to compare the models' classification accuracy for risk stratification, showing no significant differences: For the differential diagnosis, the chi2 value for ChatGPT o1 vs DeepThink-R1 was 0.5 (P = 0.48) and 0.25 (P = 0.62) for the provisional diagnosis, respectively. Cohen’s Kappa was used to assess inter-rater agreement in risk stratification, revealing substantial agreement for the differential diagnosis (Kappa = 0.75, P = 0.002) and moderate agreement for the provisional diagnosis (Kappa = 0.5, P = 0.04).
This study’s findings suggest that while ChatGPT o1 identified a larger number of studies accurately, both models performed similarly in identifying acute pericarditis within a risk stratification framework, with no significant differences in their classification accuracy.
These findings align with previous studies, which have shown that DeepThink-R1 aligns with ChatGPT o1’s abilities regarding data-driven scientific tasks, as demonstrated in this study[4]. On the contrary, Chowdhury et al[5] demonstrate that DeepSeek outperforms ChatGPT in terms of adaptability and domain-specific tasks, while ChatGPT excels in providing support through Reinforcement Learning from Human Feedback to generate refined outcomes based on human interactions rather than focusing on scalability. Both models' extensive training datasets, derived from diverse information sources, constitute a significant strength. However, they also present the risk of incorporating conflicting, inconsistent, or erroneous data, especially in simpler tasks or niche domains, where content moderation comes essential in preventing the misuse of LLMs in generating misleading or harmful information[4]. ChatGPT o1 incorporates content moderation as part of its reinforcement training while this is yet undocumented regarding DeepThink-R1, which is necessary to mitigate the ethical concerns which arise from integrating LLMs into practice[4]. Nevertheless, the findings of our study show the potential both models could have particularly in resource-limited environments with restricted access to imaging or specialists. For instance, in rural clinics or low-resource areas, where specialists may be unavailable, ChatGPT o1 and DeepThink-R1, improving early decision-making and reducing delays.
Despite their potential utility in resource-limited settings where access to imaging and specialists is restricted, LLMs remain limited by inconsistencies, misinformation, and conflicting outputs. Furthermore, neither model can fully replicate the nuanced clinical judgment of a physician[4]. Unlike AI, clinicians integrate multiple data sources, including non-verbal cues, evolving patient presentations, and sociocultural contexts-factors that AI may misinterpret or overlook[5]. Thus, LLMs should be considered adjunct tools rather than replacements for medical expertise and could be integrated into electronic medical records to assist in identifying missed diagnoses or suggesting additional diagnostic tests.
Of note, typical diagnostic features of pericarditis, such as pleuritic chest pain, pericardial friction rub, and characteristic ECG changes, may be subtle or even absent in some cases, where > 85%-90% and no more than 60% of patients present with chest pain and ECG changes, respectively, complicating the diagnosis[6]. Misdiagnosing pericarditis as myocardial infarction, for instance, occurs in up to 19%-25% of patients, and may result in the inappropriate administration of thrombolytic therapy, which can worsen patient outcomes[6]. In this study, both ChatGPT o1 and DeepThink-R1 correctly identified cases of pericarditis when patients exhibited classic features upon presentation. However, their accuracy declined in atypical presentations or cases complicated by comorbid conditions, underscoring their reliance on prototypical patterns while remaining limited when confronted with atypical presentations. This aligns with a study showing that a pericarditis deep learning model misinterpreted 10.2% of STEMI ECGs as cases of acute pericarditis, despite outperforming participating human experts and algorithms based on traditional ECG features[2]. This reinforces the role physicians have in determining atypical scenarios in patients, which these models do not yet detect reliably.
Building on their current advancements, future studies should focus on refining these LLMs to enhance their utility in assisting complex medical diagnoses, such as acute pericarditis. Additionally, further studies should incorporate stratified analyses that explicitly examine how demographic and clinical heterogeneity influence the diagnostic outputs of LLMs. This includes evaluating whether variables such as age-related physiological changes, gender-specific symptomatology, and multimorbidity profiles systematically affect model performance in risk stratification. Additionally, efforts should be made to curate and integrate more representative datasets that encompass atypical and underrepresented clinical scenarios. Such methodological rigor is essential for the development of models that are not only diagnostically accurate but also generalizable, equitable, and clinically trustworthy across diverse patient populations.
Although this study provides important insights, there are multiple limitations to consider and improve in future studies. Due to the limited number of case reports, the generalizability of this study’s results remains limited, which could be enhanced by including a larger number of cases. Additionally, only including studies written in English introduces selection bias and further limits the generalizability of our findings to non-English-speaking populations. In future studies, this could be mitigated by incorporating multilingual review teams and leveraging translation services to include studies published in other languages.
Another limitation is the lack of case-by-case analysis, which overlooks the complexity of pericarditis and the diagnostic rationale employed by the models. A deeper qualitative evaluation of model reasoning is essential in future work. Furthermore, future research should explore the risk stratification capabilities of LLMs in stratified patient populations - considering factors such as demographics and comorbidities - to better assess their clinical relevance and potential for integration into real-world healthcare settings.
In conclusion, this study demonstrates that LLMs show potential as companion tools in the risk stratification of conditions such as pericarditis, where delayed diagnosis may lead to significant morbidity. Nevertheless, given the heterogeneity in patient presentation and condition-specific diagnostic criteria, future studies must explore the use of LLMs and their efficacy across a broader spectrum of patients and diseases.
1. | Diniz-Freitas M, Diz-Dios P. DeepSeek: Another step forward in the diagnosis of oral lesions. J Dent Sci. 2025;20:1904-1907. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 2] [Cited by in RCA: 1] [Article Influence: 1.0] [Reference Citation Analysis (0)] |
2. | Liu YL, Lin CS, Cheng CC, Lin C. A Deep Learning Algorithm for Detecting Acute Pericarditis by Electrocardiogram. J Pers Med. 2022;12:1150. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Cited by in Crossref: 1] [Cited by in RCA: 11] [Article Influence: 3.7] [Reference Citation Analysis (0)] |
3. | Persaud S, Singh B, Angelo D. An Atypical Etiology of Acute Pericarditis: A Case Report. Cureus. 2021;13:e13440. [RCA] [PubMed] [DOI] [Full Text] [Full Text (PDF)] [Reference Citation Analysis (0)] |
4. | Kayaalp ME, Prill R, Sezgin EA, Cong T, Królikowska A, Hirschmann MT. DeepSeek versus ChatGPT: Multimodal artificial intelligence revolutionizing scientific discovery. From language editing to autonomous content generation-Redefining innovation in research and practice. Knee Surg Sports Traumatol Arthrosc. 2025;33:1553-1556. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 5] [Cited by in RCA: 3] [Article Influence: 3.0] [Reference Citation Analysis (0)] |
5. | Chowdhury MN, Haque A, Ahmed I. DeepSeek vs. ChatGPT: A Comparative Analysis of Performance, Efficiency, and Ethical AI Considerations. TechRxiv. 2025;. [DOI] [Full Text] |
6. | Khandaker MH, Espinosa RE, Nishimura RA, Sinak LJ, Hayes SN, Melduni RM, Oh JK. Pericardial disease: diagnosis and management. Mayo Clin Proc. 2010;85:572-593. [RCA] [PubMed] [DOI] [Full Text] [Cited by in Crossref: 201] [Cited by in RCA: 187] [Article Influence: 12.5] [Reference Citation Analysis (0)] |