Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Introduction

Large language models (LLMs) have gained traction in medicine, but there is limited research comparing closed- and open-source models in subspecialty contexts. This study evaluated ChatGPT-4.0o and DeepSeek–R1 on a pediatric cardiology board-style examination to quantify their accuracy and discuss clinical and educational utility.

Methods

ChatGPT-4.0o and DeepSeek–R1 were used to answer 88 text-based multiple-choice questions across 11 pediatric cardiology subtopics from aPediatric Cardiology Board Reviewtextbook. DeepSeek–R1’s processing time per question was measured. Statistical analyses for model comparison were conducted using an unpaired two-tailed t-test, and bivariate correlations were assessed using Pearson’s r.

Results

ChatGPT-4.0o and DeepSeek–R1 achieved 70% (62/88) and 68% (60/88) accuracy, respectively (p=0.79). Subtopic accuracy was equal in 5 of 11 chapters, with each model outperforming its counterpart in 3 of 11. DeepSeek–R1’s processing time negatively correlated with accuracy (r = –0.68, p = 0.02).

Conclusion

ChatGPT-4.0o and DeepSeek–R1 approached the passing threshold on a pediatric cardiology board examination, with comparable accuracy and potential for open-source models to enhance clinical and educational outcomes while supporting sustainable AI development.

Related articles

Related articles are currently not available for this article.