Benchmarking Large Language Models for Replication of Guideline-Based PGx Recommendations

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene–drug–phenotype scenarios, we compared five leading models, including GPT-4o and fine-tuned LLaMA variants, through both standard lexical metrics and a novel semantic evaluation framework (LLM Score) validated by expert review. General-purpose models frequently produced incomplete or unsafe outputs, while our domain-adapted model achieved superior performance, with an LLM Score of 0.92 and significantly faster inference speed. Results highlight the importance of fine-tuning and structured prompting over model scale alone. This work establishes a robust framework for evaluating PGx-specific LLMs and demonstrates the feasibility of safer, AI-driven personalized medicine.

Related articles

Related articles are currently not available for this article.