Published online Aug 26, 2025. doi: 10.4330/wjc.v17.i8.110489
Revised: June 13, 2025
Accepted: July 31, 2025
Published online: August 26, 2025
Processing time: 74 Days and 22.5 Hours
The integration of sophisticated large language models (LLMs) into healthcare has recently garnered significant attention due to their ability to leverage deep learning techniques to process vast datasets and generate contextually accurate, human-like responses. These models have been previously applied in medical diagnostics, such as in the evaluation of oral lesions. Given the high rate of missed diagnoses in pericarditis, LLMs may support clinicians in generating differential diagnoses-particularly in atypical cases where risk stratification and early identification are critical to preventing serious complications such as constrictive pericarditis and pericardial tamponade.
To compare the accuracy of LLMs in assisting the diagnosis of pericarditis as risk stratification tools.
A PubMed search was conducted using the keyword “pericarditis”, applying filters for “case reports”. Data from relevant cases were extracted. Inclusion criteria consisted of English-language reports involving patients aged 18 years or older with a confirmed diagnosis of acute pericarditis. The diagnostic capabilities of ChatGPT o1 and DeepThink-R1 were assessed by evaluating whether pericarditis was included in the top three differential diagnoses and as the sole provisional diagnosis. Each case was classified as either “yes” or “no” for inclusion.
From the initial search, 220 studies were identified, of which 16 case reports met the inclusion criteria. In assessing risk stratification for acute pericarditis, ChatGPT o1 correctly identified the condition in 10 of 16 cases (62.5%) in the differential diagnosis and in 8 of 16 cases (50.0%) as the provisional diagnosis. DeepThink-R1 identified it in 8 of 16 cases (50.0%) and 6 of 16 cases (37.5%), respectively. ChatGPT o1 demonstrated higher accuracy than DeepThink-R1 in identifying pericarditis.
Further research with larger sample sizes and optimized prompt engineering is warranted to improve diagnostic accuracy, particularly in atypical presentations.
Core Tip: This study evaluates the capabilities of large language models (LLMs), ChatGPT o1 and DeepSeek-R1, in the risk stratification of acute pericarditis, where delayed diagnosis may lead to significant complications. While both LLMs show similar performance and promise as supportive tools in identifying high-risk presentations, their current limitations in recognizing atypical symptom profiles underscore the need for further refinement. Future research should focus on improving model sensitivity to demographic and clinical variability to ensure broader applicability and safety in real-world settings.